跳到主要内容

2025-05-22-12-06

COSMIC: Enabling Full-Stack Co-Design and Optimization of Distributed Machine Learning Systems

Abstract

arXiv:2505.15020v1 Announce Type: new Abstract: Large-scale machine learning models necessitate distributed systems, posing significant design challenges due to the large parameter space across distinct design stacks. Existing studies often focus on optimizing individual system aspects in isolation. This work challenges this limitation and introduces COSMIC, a full-stack distributed machine learning systems environment enabling end-to-end simulation and agent-based design space exploration. To facilitate efficient exploration and optimization across the entire stack, we introduce Parameter Set Architecture-an abstraction concept analogous to the instruction set architecture-abstracting away configuration complexities of agent-based search methods. Case studies demonstrate COSMIC's ability to consolidate parameters across multiple layers of design abstraction, discovering eight non-obvious high-performance system configurations across four transformer-based models with up to 175 billion parameters. By optimizing across the stack, COSMIC full-stack optimization delivers 1.50-48.41x higher performance compared to the isolated single-stack optimization.

摘要

大规模机器学习模型需要分布式系统支持,由于不同设计栈间庞大的参数空间,这带来了重大设计挑战。现有研究往往孤立地优化单个系统层面。本研究突破了这一局限,提出COSMIC——一个支持端到端仿真和基于智能体的设计空间探索的全栈分布式机器学习系统环境。为促进跨全栈的高效探索与优化,我们提出了参数集架构(Parameter Set Architecture)这一抽象概念,其作用类似于指令集架构,可消除基于智能体的搜索方法在配置上的复杂性。案例研究表明,COSMIC能够整合跨多层级设计抽象的参数,在四个参数量高达1750亿的基于Transformer的模型中,发现了八种非显而易见的高性能系统配置。通过全栈优化,COSMIC相比孤立单栈优化实现了1.50-48.41倍的性能提升。


Balanced and Elastic End-to-end Training of Dynamic LLMs

Abstract

arXiv:2505.14864v1 Announce Type: new Abstract: To reduce computational and memory costs in Large Language Models (LLMs), dynamic workload reduction schemes like Mixture of Experts (MoEs), parameter pruning, layer freezing, sparse attention, early token exit, and Mixture of Depths (MoDs) have emerged. However, these methods introduce severe workload imbalances, limiting their practicality for large-scale distributed training. We propose DynMo, an autonomous dynamic load balancing solution that ensures optimal compute distribution when using pipeline parallelism in training dynamic models. DynMo adaptively balances workloads, dynamically packs tasks into fewer workers to free idle resources, and supports both multi-GPU single-node and multi-node systems. Compared to static training methods (Megatron-LM, DeepSpeed), DynMo accelerates training by up to 1.23x (MoEs), 3.18x (pruning), 2.23x (layer freezing), 4.02x (sparse attention), 4.52x (early exit), and 1.17x (MoDs). DynMo is available at https://anonymous.4open.science/r/DynMo-4D04/.

摘要

为降低大型语言模型(LLMs)的计算和内存成本,业界提出了多种动态工作负载缩减方案,如专家混合模型(MoEs)、参数剪枝、层冻结、稀疏注意力、早期令牌退出和深度混合模型(MoDs)。然而,这些方法会导致严重的负载不均衡问题,限制了其在大规模分布式训练中的实用性。我们提出DynMo——一种自主动态负载均衡解决方案,可在训练动态模型时通过流水线并行实现最优计算资源分配。DynMo能自适应平衡工作负载,动态将任务打包至更少的工作节点以释放闲置资源,并支持多GPU单节点与多节点系统。与静态训练方法(Megatron-LM、DeepSpeed)相比,DynMo在MoEs场景下训练速度提升达1.23倍,剪枝场景3.18倍,层冻结场景2.23倍,稀疏注意力场景4.02倍,早期退出场景4.52倍,MoDs场景1.17倍。DynMo项目地址:https://anonymous.4open.science/r/DynMo-4D04/。


FOL-Pretrain: A complexity annotated corpus of first-order logic

Abstract

arXiv:2505.14932v1 Announce Type: new Abstract: Transformer-based large language models (LLMs) have demonstrated remarkable reasoning capabilities such as coding and solving mathematical problems to commonsense inference. While these tasks vary in complexity, they all require models to integrate and compute over structured information. Despite recent efforts to reverse-engineer LLM behavior through controlled experiments, our understanding of how these models internalize and execute complex algorithms remains limited. Progress has largely been confined to small-scale studies or shallow tasks such as basic arithmetic and grammatical pattern matching. One barrier to deeper understanding is the nature of pretraining data -- vast, heterogeneous, and often poorly annotated, making it difficult to isolate mechanisms of reasoning. To bridge this gap, we introduce a large-scale, fully open, complexity-annotated dataset of first-order logic reasoning traces, designed to probe and analyze algorithmic reasoning in LLMs. The dataset consists of 3.5 billion tokens, including 8.8 million LLM-augmented, human-annotated examples and 7.5 million synthetically generated examples. Each synthetic example is verifiably correct, produced by a custom automated theorem solver, and accompanied by metadata tracing its algorithmic provenance. We aim to provide a scalable, interpretable artifact for studying how LLMs learn and generalize symbolic reasoning processes, paving the way for more transparent and targeted investigations into the algorithmic capabilities of modern models.

摘要

基于Transformer架构的大规模语言模型(LLMs)已展现出卓越的推理能力,涵盖从编程、数学问题求解到常识推理等多个领域。尽管这些任务的复杂度各异,但均要求模型对结构化信息进行整合与运算。尽管近期已有研究通过受控实验逆向解析LLM行为,我们对其内部实现复杂算法的机制理解仍显不足。现有进展主要局限于小规模研究或浅层任务,如基础算术和语法模式匹配。深入理解的障碍之一在于预训练数据的特性——海量、异构且往往缺乏标注,这使得分离推理机制变得困难。为弥合这一鸿沟,我们提出了一个大规模、完全开放且标注复杂度的一阶逻辑推理追踪数据集,旨在探究和分析LLM的算法推理能力。该数据集包含35亿标记,含880万条经LLM增强的人工标注样本和750万条合成生成样本。每条合成样本均由定制自动定理证明器生成,其正确性可验证,并附带追溯算法来源的元数据。我们期望通过这一可扩展、可解释的数据集,为研究LLM如何学习与泛化符号推理过程提供工具,从而为现代模型的算法能力研究开辟更透明、更具针对性的路径。


Generalised Probabilistic Modelling and Improved Uncertainty Estimation in Comparative LLM-as-a-judge

Abstract

arXiv:2505.15240v1 Announce Type: new Abstract: This paper explores generalised probabilistic modelling and uncertainty estimation in comparative LLM-as-a-judge frameworks. We show that existing Product-of-Experts methods are specific cases of a broader framework, enabling diverse modelling options. Furthermore, we propose improved uncertainty estimates for individual comparisons, enabling more efficient selection and achieving strong performance with fewer evaluations. We also introduce a method for estimating overall ranking uncertainty. Finally, we demonstrate that combining absolute and comparative scoring improves performance. Experiments show that the specific expert model has a limited impact on final rankings but our proposed uncertainty estimates, especially the probability of reordering, significantly improve the efficiency of systems reducing the number of needed comparisons by ~50%. Furthermore, ranking-level uncertainty metrics can be used to identify low-performing predictions, where the nature of the probabilistic model has a notable impact on the quality of the overall uncertainty.

摘要

本文探讨了比较性LLM-as-a-judge框架中的广义概率建模与不确定性估计。研究表明,现有专家乘积方法是更广泛框架的特例,该框架支持多样化的建模选择。我们进一步提出了改进的个体比较不确定性估计方法,可实现更高效的选择,并通过更少的评估次数获得强劲性能。同时,我们提出了一种估计整体排序不确定性的新方法。实验证明,结合绝对评分与比较评分能提升系统性能。具体而言,专家模型对最终排序影响有限,但我们提出的不确定性估计(尤其是重排序概率)能显著提升系统效率,将所需比较次数减少约50%。此外,排序级不确定性指标可用于识别低质量预测,其中概率模型的特性对整体不确定性质量具有显著影响。


When Can Large Reasoning Models Save Thinking? Mechanistic Analysis of Behavioral Divergence in Reasoning

Abstract

arXiv:2505.15276v1 Announce Type: new Abstract: Large reasoning models (LRMs) have significantly advanced performance on complex tasks, yet their tendency to overthink introduces inefficiencies. This study investigates the internal mechanisms of reinforcement learning (RL)-trained LRMs when prompted to save thinking, revealing three distinct thinking modes: no thinking (NT), explicit thinking (ET), and implicit thinking (IT). Through comprehensive analysis of confidence in thinking termination, attention from thinking to generation, and attentional focus on input sections, we uncover key factors influencing the reasoning behaviors. We further find that NT reduces output length at the cost of accuracy, while ET and IT maintain accuracy with reduced response length. Our findings expose fundamental inconsistencies in RL-optimized LRMs, necessitating adaptive improvements for reliable efficiency.

摘要

大型推理模型(LRMs)在复杂任务上取得了显著性能提升,但其过度思考倾向导致效率低下。本研究探究了经过强化学习(RL)训练的LRMs在要求节省思考时的内部机制,揭示了三种不同的思考模式:无思考(NT)、显性思考(ET)和隐性思考(IT)。通过对思考终止置信度、从思考到生成的注意力转移以及输入部分关注焦点的综合分析,我们发现了影响推理行为的关键因素。进一步研究发现,NT模式以降低准确性为代价缩短输出长度,而ET和IT模式能在保持准确性的同时减少响应长度。我们的研究结果揭示了RL优化LRMs中存在的基本不一致性,亟需通过自适应改进来实现可靠的效率提升。


ModelingAgent: Bridging LLMs and Mathematical Modeling for Real-World Challenges

Abstract

arXiv:2505.15068v1 Announce Type: new Abstract: Recent progress in large language models (LLMs) has enabled substantial advances in solving mathematical problems. However, existing benchmarks often fail to reflect the complexity of real-world problems, which demand open-ended, interdisciplinary reasoning and integration of computational tools. To address this gap, we introduce ModelingBench, a novel benchmark featuring real-world-inspired, open-ended problems from math modeling competitions across diverse domains, ranging from urban traffic optimization to ecosystem resource planning. These tasks require translating natural language into formal mathematical formulations, applying appropriate tools, and producing structured, defensible reports. ModelingBench also supports multiple valid solutions, capturing the ambiguity and creativity of practical modeling. We also present ModelingAgent, a multi-agent framework that coordinates tool use, supports structured workflows, and enables iterative self-refinement to generate well-grounded, creative solutions. To evaluate outputs, we further propose ModelingJudge, an expert-in-the-loop system leveraging LLMs as domain-specialized judges assessing solutions from multiple expert perspectives. Empirical results show that ModelingAgent substantially outperforms strong baselines and often produces solutions indistinguishable from those of human experts. Together, our work provides a comprehensive framework for evaluating and advancing real-world problem-solving in open-ended, interdisciplinary modeling challenges.

摘要

大语言模型(LLMs)的最新进展在解决数学问题方面取得了显著突破。然而,现有基准测试往往无法反映现实世界问题的复杂性,这些问题需要开放式的跨学科推理以及计算工具的整合。为填补这一空白,我们提出了ModelingBench——一个新颖的基准测试,其灵感来源于现实世界,包含从城市交通优化到生态系统资源规划等多个领域的数学建模竞赛中的开放式问题。这些任务要求将自然语言转化为正式的数学表述,应用适当的工具,并生成结构化的、可辩护的报告。ModelingBench还支持多种有效解决方案,以捕捉实际建模中的模糊性和创造性。我们还提出了ModelingAgent,这是一个多智能体框架,能够协调工具使用、支持结构化工作流程,并实现迭代自我优化,从而生成有据可依的创造性解决方案。为了评估输出结果,我们进一步提出了ModelingJudge,这是一个专家参与循环的系统,利用LLMs作为领域专业评委,从多个专家视角评估解决方案。实证结果表明,ModelingAgent显著优于强基线模型,其生成的解决方案往往与人类专家的方案难以区分。总之,我们的工作为评估和推进开放式跨学科建模挑战中的现实问题解决提供了一个全面框架。


Reinforcement Learning from User Feedback

Abstract

arXiv:2505.14946v1 Announce Type: new Abstract: As large language models (LLMs) are increasingly deployed in diverse user facing applications, aligning them with real user preferences becomes essential. Existing methods like Reinforcement Learning from Human Feedback (RLHF) rely on expert annotators trained on manually defined guidelines, whose judgments may not reflect the priorities of everyday users. We introduce Reinforcement Learning from User Feedback (RLUF), a framework for aligning LLMs directly to implicit signals from users in production. RLUF addresses key challenges of user feedback: user feedback is often binary (e.g., emoji reactions), sparse, and occasionally adversarial. We train a reward model, P[Love], to predict the likelihood that an LLM response will receive a Love Reaction, a lightweight form of positive user feedback, and integrate P[Love] into a multi-objective policy optimization framework alongside helpfulness and safety objectives. In large-scale experiments, we show that P[Love] is predictive of increased positive feedback and serves as a reliable offline evaluator of future user behavior. Policy optimization using P[Love] significantly raises observed positive-feedback rates, including a 28% increase in Love Reactions during live A/B tests. However, optimizing for positive reactions introduces reward hacking challenges, requiring careful balancing of objectives. By directly leveraging implicit signals from users, RLUF offers a path to aligning LLMs with real-world user preferences at scale.

摘要

随着大语言模型(LLMs)在多样化用户端应用中的日益普及,使其与真实用户偏好保持一致变得至关重要。现有方法如基于人类反馈的强化学习(RLHF)依赖于经过人工定义准则培训的专家标注者,其判断可能无法反映普通用户的优先级。我们提出基于用户反馈的强化学习(RLUF),该框架通过直接利用生产环境中用户的隐式信号来实现LLMs的对齐。RLUF解决了用户反馈的关键挑战:用户反馈通常是二元化的(如表情符号反应)、稀疏的且偶尔具有对抗性。我们训练了一个奖励模型P[Love]来预测LLM回复获得"爱心反应"(一种轻量级正向用户反馈形式)的概率,并将P[Love]与有用性和安全性目标共同整合到多目标策略优化框架中。大规模实验表明,P[Love]能有效预测正向反馈的增长,并可作为未来用户行为的可靠离线评估指标。使用P[Love]进行策略优化显著提升了观测到的正向反馈率,包括在实时A/B测试中"爱心反应"增加28%。然而,优化正向反应会引发奖励破解挑战,需要谨慎平衡各项目标。通过直接利用用户的隐式信号,RLUF为大规模实现LLMs与现实用户偏好的对齐提供了可行路径。


Self-Evolving Curriculum for LLM Reasoning

Abstract

arXiv:2505.14970v1 Announce Type: new Abstract: Reinforcement learning (RL) has proven effective for fine-tuning large language models (LLMs), significantly enhancing their reasoning abilities in domains such as mathematics and code generation. A crucial factor influencing RL fine-tuning success is the training curriculum: the order in which training problems are presented. While random curricula serve as common baselines, they remain suboptimal; manually designed curricula often rely heavily on heuristics, and online filtering methods can be computationally prohibitive. To address these limitations, we propose Self-Evolving Curriculum (SEC), an automatic curriculum learning method that learns a curriculum policy concurrently with the RL fine-tuning process. Our approach formulates curriculum selection as a non-stationary Multi-Armed Bandit problem, treating each problem category (e.g., difficulty level or problem type) as an individual arm. We leverage the absolute advantage from policy gradient methods as a proxy measure for immediate learning gain. At each training step, the curriculum policy selects categories to maximize this reward signal and is updated using the TD(0) method. Across three distinct reasoning domains: planning, inductive reasoning, and mathematics, our experiments demonstrate that SEC significantly improves models' reasoning capabilities, enabling better generalization to harder, out-of-distribution test problems. Additionally, our approach achieves better skill balance when fine-tuning simultaneously on multiple reasoning domains. These findings highlight SEC as a promising strategy for RL fine-tuning of LLMs.

摘要

强化学习(RL)已被证明能有效微调大语言模型(LLMs),显著提升其在数学和代码生成等领域的推理能力。影响RL微调成功的关键因素是训练课程——即训练问题呈现的顺序。虽然随机课程作为常见基线,但其效果仍欠佳;手动设计的课程通常严重依赖启发式方法,而在线过滤方法可能计算成本过高。为解决这些局限,我们提出自进化课程(SEC),这是一种在RL微调过程中同步学习课程策略的自动课程学习方法。该方法将课程选择建模为非平稳多臂老虎机问题,将每个问题类别(如难度级别或问题类型)视为独立臂。我们利用策略梯度方法的绝对优势作为即时学习收益的代理指标。在每一步训练中,课程策略选择能最大化该奖励信号的类别,并通过TD(0)方法进行更新。在规划、归纳推理和数学三个不同推理领域的实验中,SEC显著提升了模型的推理能力,使其能更好地泛化至更难的分布外测试问题。此外,当在多个推理领域同时微调时,该方法能实现更好的技能平衡。这些发现表明SEC是LLMs强化学习微调的一种有效策略。


When to Continue Thinking: Adaptive Thinking Mode Switching for Efficient Reasoning

Abstract

arXiv:2505.15400v1 Announce Type: new Abstract: Large reasoning models (LRMs) achieve remarkable performance via long reasoning chains, but often incur excessive computational overhead due to redundant reasoning, especially on simple tasks. In this work, we systematically quantify the upper bounds of LRMs under both Long-Thinking and No-Thinking modes, and uncover the phenomenon of "Internal Self-Recovery Mechanism" where models implicitly supplement reasoning during answer generation. Building on this insight, we propose Adaptive Self-Recovery Reasoning (ASRR), a framework that suppresses unnecessary reasoning and enables implicit recovery. By introducing accuracy-aware length reward regulation, ASRR adaptively allocates reasoning effort according to problem difficulty, achieving high efficiency with negligible performance sacrifice. Experiments across multiple benchmarks and models show that, compared with GRPO, ASRR reduces reasoning budget by up to 32.5% (1.5B) and 25.7% (7B) with minimal accuracy loss (1.2% and 0.6% pass@1), and significantly boosts harmless rates on safety benchmarks (up to +21.7%). Our results highlight the potential of ASRR for enabling efficient, adaptive, and safer reasoning in LRMs.

摘要

大型推理模型(LRMs)通过长推理链实现了卓越性能,但由于冗余推理(尤其在简单任务上)常导致过高计算开销。本研究系统量化了LRMs在"长思考"与"无思考"模式下的性能上限,揭示了模型在答案生成过程中隐式补充推理的"内部自恢复机制"现象。基于此发现,我们提出自适应自恢复推理框架(ASRR),通过抑制非必要推理并启用隐式恢复机制,结合精度感知的长度奖励调节,根据问题难度自适应分配推理资源,以可忽略的性能代价实现高效推理。跨多基准和模型的实验表明:相较于GRPO,ASRR在1.5B和7B模型上分别最高减少32.5%和25.7%的推理预算(仅损失1.2%和0.6%的pass@1准确率),并在安全基准上显著提升无害率(最高+21.7%)。研究结果证明了ASRR在实现高效、自适应且更安全的LRMs推理方面的潜力。


lmgame-Bench: How Good are LLMs at Playing Games?

Abstract

arXiv:2505.15146v1 Announce Type: new Abstract: Playing video games requires perception, memory, and planning, exactly the faculties modern large language model (LLM) agents are expected to master. We study the major challenges in using popular video games to evaluate modern LLMs and find that directly dropping LLMs into games cannot make an effective evaluation, for three reasons -- brittle vision perception, prompt sensitivity, and potential data contamination. We introduce lmgame-Bench to turn games into reliable evaluations. lmgame-Bench features a suite of platformer, puzzle, and narrative games delivered through a unified Gym-style API and paired with lightweight perception and memory scaffolds, and is designed to stabilize prompt variance and remove contamination. Across 13 leading models, we show lmgame-Bench is challenging while still separating models well. Correlation analysis shows that every game probes a unique blend of capabilities often tested in isolation elsewhere. More interestingly, performing reinforcement learning on a single game from lmgame-Bench transfers both to unseen games and to external planning tasks. Our evaluation code is available at https://github.com/lmgame-org/GamingAgent/lmgame-bench.

摘要

电子游戏操作需要感知、记忆与规划能力,这正是现代大语言模型(LLM)智能体被要求掌握的核心能力。本研究分析了利用主流电子游戏评估现代LLM的主要挑战,发现直接将其植入游戏无法实现有效评估,原因有三——脆弱的视觉感知、提示敏感度及潜在数据污染。为此,我们推出lmgame-Bench评估框架,通过标准化方法将游戏转化为可靠评估工具。该框架集成平台跳跃、解谜与叙事类游戏,通过统一Gym风格API交付,配备轻量级感知与记忆支架,旨在稳定提示差异并消除数据污染。基于13个前沿模型的测试表明,lmgame-Bench在保持高区分度的同时具备足够挑战性。相关性分析显示,每款游戏都能探测模型独特的能力组合,这些能力在其他测试中往往被孤立检验。更有趣的是,在lmgame-Bench单个游戏上进行的强化学习,其能力可迁移至未见游戏及外部规划任务。评估代码已开源:https://github.com/lmgame-org/GamingAgent/lmgame-bench。


ClickSight: Interpreting Student Clickstreams to Reveal Insights on Learning Strategies via LLMs

Abstract

arXiv:2505.15410v1 Announce Type: new Abstract: Clickstream data from digital learning environments offer valuable insights into students' learning behaviors, but are challenging to interpret due to their high dimensionality and granularity. Prior approaches have relied mainly on handcrafted features, expert labeling, clustering, or supervised models, therefore often lacking generalizability and scalability. In this work, we introduce ClickSight, an in-context Large Language Model (LLM)-based pipeline that interprets student clickstreams to reveal their learning strategies. ClickSight takes raw clickstreams and a list of learning strategies as input and generates textual interpretations of students' behaviors during interaction. We evaluate four different prompting strategies and investigate the impact of self-refinement on interpretation quality. Our evaluation spans two open-ended learning environments and uses a rubric-based domain-expert evaluation. Results show that while LLMs can reasonably interpret learning strategies from clickstreams, interpretation quality varies by prompting strategy, and self-refinement offers limited improvement. ClickSight demonstrates the potential of LLMs to generate theory-driven insights from educational interaction data.

摘要

数字学习环境中的点击流数据为理解学生学习行为提供了宝贵洞见,但由于其高维度和细粒度特性,解读存在挑战。现有方法主要依赖手工特征工程、专家标注、聚类或监督模型,普遍存在泛化性和可扩展性不足的问题。本研究提出ClickSight——一种基于大语言模型(LLM)的情境化分析流程,通过解读学生点击流揭示其学习策略。该系统以原始点击流和学习策略列表作为输入,生成描述学生交互行为的文本解释。我们评估了四种不同的提示策略,并探究了自我优化对解释质量的影响。实验涵盖两个开放式学习环境,采用基于量表的领域专家评估。结果表明:虽然大语言模型能够合理地从点击流中解读学习策略,但解释质量因提示策略而异,且自我优化带来的改进有限。ClickSight证实了大语言模型从教育交互数据中生成理论驱动型洞见的潜力。


THELMA: Task Based Holistic Evaluation of Large Language Model Applications-RAG Question Answering

Abstract

arXiv:2505.11626v1 Announce Type: cross Abstract: We propose THELMA (Task Based Holistic Evaluation of Large Language Model Applications), a reference free framework for RAG (Retrieval Augmented generation) based question answering (QA) applications. THELMA consist of six interdependent metrics specifically designed for holistic, fine grained evaluation of RAG QA applications. THELMA framework helps developers and application owners evaluate, monitor and improve end to end RAG QA pipelines without requiring labelled sources or reference responses.We also present our findings on the interplay of the proposed THELMA metrics, which can be interpreted to identify the specific RAG component needing improvement in QA applications.

摘要

我们提出THELMA(基于任务的大语言模型应用整体评估框架),这是一个无需参考的检索增强生成(RAG)问答应用评估框架。THELMA包含六个相互关联的指标,专门用于对RAG问答应用进行整体细粒度评估。该框架帮助开发者和应用所有者无需标注数据或参考答案即可评估、监控和改进端到端RAG问答流程。我们还揭示了THELMA指标间的相互作用规律,通过解读这些规律可识别问答应用中需要改进的特定RAG组件。


Toward Open Earth Science as Fast and Accessible as Natural Language

Abstract

arXiv:2505.15690v1 Announce Type: new Abstract: Is natural-language-driven earth observation data analysis now feasible with the assistance of Large Language Models (LLMs)? For open science in service of public interest, feasibility requires reliably high accuracy, interactive latencies, low (sustainable) costs, open LLMs, and openly maintainable software -- hence, the challenge. What are the techniques and programming system requirements necessary for satisfying these constraints, and what is the corresponding development and maintenance burden in practice? This study lays the groundwork for exploring these questions, introducing an impactful earth science use-case, and providing a software framework with evaluation data and metrics, along with initial results from employing model scaling, prompt-optimization, and inference-time scaling optimization techniques. While we attain high accuracy (near 100%) across 10 of 11 metrics, the analysis further considers cost (token-spend), latency, and maintainability across this space of techniques. Finally, we enumerate opportunities for further research, general programming and evaluation framework development, and ongoing work for a comprehensive, deployable solution. This is a call for collaboration and contribution.

摘要

在大型语言模型(LLMs)的辅助下,基于自然语言驱动的地球观测数据分析目前是否可行?为服务于公共利益的开放科学,可行性需满足以下要求:持续可靠的高精度、交互式低延迟、低成本(可持续性)、开放的大型语言模型以及可公开维护的软件——这正是挑战所在。满足这些约束条件需要哪些技术及编程系统要求?实际开发与维护负担如何?本研究为探索这些问题奠定基础,引入了一个具有影响力的地球科学应用案例,并提供了包含评估数据、指标的软件框架,以及采用模型缩放、提示优化和推理时间缩放优化技术的初步结果。虽然我们在11项指标中的10项实现了高精度(接近100%),但分析进一步考量了这些技术方案在成本(token消耗)、延迟和可维护性方面的表现。最后,我们列举了未来研究方向、通用编程与评估框架开发机遇,以及构建全面可部署解决方案的后续工作。此研究旨在呼吁合作与贡献。


The Energy Cost of Reasoning: Analyzing Energy Usage in LLMs with Test-time Compute

Abstract

arXiv:2505.14733v1 Announce Type: cross Abstract: Scaling large language models (LLMs) has driven significant advancements, yet it faces diminishing returns and escalating energy demands. This work introduces test-time compute (TTC)-allocating additional computational resources during inference-as a compelling complement to conventional scaling strategies. Specifically, we investigate whether employing TTC can achieve superior accuracy-energy trade-offs compared to simply increasing model size. Our empirical analysis reveals that TTC surpasses traditional model scaling in accuracy/energy efficiency, with notable gains in tasks demanding complex reasoning rather than mere factual recall. Further, we identify a critical interaction between TTC performance and output sequence length, demonstrating that strategically adjusting compute resources at inference time according to query complexity can substantially enhance efficiency. Our findings advocate for TTC as a promising direction, enabling more sustainable, accurate, and adaptable deployment of future language models without incurring additional pretraining costs.

摘要

大规模语言模型(LLM)的扩展虽推动了显著技术进步,却面临收益递减与能耗激增的问题。本研究提出测试时计算分配(TTC)——在推理阶段动态分配额外计算资源——作为传统扩展策略的创新补充。通过实证分析,我们发现相较于单纯增加模型规模,TTC能在准确率/能效比上实现更优权衡,尤其在需要复杂推理而非单纯事实检索的任务中表现突出。进一步研究发现TTC性能与输出序列长度存在关键交互作用:根据查询复杂度在推理时策略性调整计算资源,可显著提升效率。本研究论证了TTC作为未来语言模型部署的新方向,能在不增加预训练成本的前提下,实现更可持续、精准且自适应的模型应用。


\texttt{LLINBO}: Trustworthy LLM-in-the-Loop Bayesian Optimization

Abstract

arXiv:2505.14756v1 Announce Type: cross Abstract: Bayesian optimization (BO) is a sequential decision-making tool widely used for optimizing expensive black-box functions. Recently, Large Language Models (LLMs) have shown remarkable adaptability in low-data regimes, making them promising tools for black-box optimization by leveraging contextual knowledge to propose high-quality query points. However, relying solely on LLMs as optimization agents introduces risks due to their lack of explicit surrogate modeling and calibrated uncertainty, as well as their inherently opaque internal mechanisms. This structural opacity makes it difficult to characterize or control the exploration-exploitation trade-off, ultimately undermining theoretical tractability and reliability. To address this, we propose LLINBO: LLM-in-the-Loop BO, a hybrid framework for BO that combines LLMs with statistical surrogate experts (e.g., Gaussian Processes (GP)). The core philosophy is to leverage contextual reasoning strengths of LLMs for early exploration, while relying on principled statistical models to guide efficient exploitation. Specifically, we introduce three mechanisms that enable this collaboration and establish their theoretical guarantees. We end the paper with a real-life proof-of-concept in the context of 3D printing. The code to reproduce the results can be found at https://github.com/UMDataScienceLab/LLM-in-the-Loop-BO.

摘要

贝叶斯优化(BO)是一种广泛用于优化昂贵黑箱函数的序列决策工具。近年来,大型语言模型(LLMs)在低数据量场景中展现出卓越的适应性,使其有望通过利用上下文知识提出高质量查询点,成为黑箱优化的新工具。然而,仅依赖LLMs作为优化代理存在风险,因其缺乏显式代理建模和校准的不确定性,且其内部机制本质不透明。这种结构不透明性使得难以表征或控制探索-开发的权衡,最终削弱理论可解性和可靠性。为此,我们提出LLINBO:循环贝叶斯优化中的LLM(LLM-in-the-Loop BO),一种将LLMs与统计代理专家(如高斯过程(GP))相结合的混合BO框架。其核心思想是利用LLMs的上下文推理优势进行早期探索,同时依靠原则性统计模型指导高效开发。具体而言,我们引入三种机制实现这种协作,并建立其理论保证。最后,我们通过3D打印的实际概念验证结束本文。重现结果的代码可在https://github.com/UMDataScienceLab/LLM-in-the-Loop-BO获取。


Scaling Reasoning, Losing Control: Evaluating Instruction Following in Large Reasoning Models

Abstract

arXiv:2505.14810v1 Announce Type: cross Abstract: Instruction-following is essential for aligning large language models (LLMs) with user intent. While recent reasoning-oriented models exhibit impressive performance on complex mathematical problems, their ability to adhere to natural language instructions remains underexplored. In this work, we introduce MathIF, a dedicated benchmark for evaluating instruction-following in mathematical reasoning tasks. Our empirical analysis reveals a consistent tension between scaling up reasoning capacity and maintaining controllability, as models that reason more effectively often struggle to comply with user directives. We find that models tuned on distilled long chains-of-thought or trained with reasoning-oriented reinforcement learning often degrade in instruction adherence, especially when generation length increases. Furthermore, we show that even simple interventions can partially recover obedience, though at the cost of reasoning performance. These findings highlight a fundamental tension in current LLM training paradigms and motivate the need for more instruction-aware reasoning models. We release the code and data at https://github.com/TingchenFu/MathIF.

摘要

指令跟随对于将大型语言模型(LLMs)与用户意图对齐至关重要。尽管近期以推理为导向的模型在复杂数学问题上展现出卓越性能,但其遵循自然语言指令的能力仍未被充分探索。本研究提出MathIF——一个专门用于评估数学推理任务中指令跟随能力的基准。实证分析表明,扩展推理能力与保持可控性之间存在持续张力:推理能力越强的模型往往越难遵循用户指令。我们发现,基于蒸馏长思维链微调的模型或采用推理导向强化学习训练的模型,其指令遵循能力通常会下降,尤其在生成长度增加时。此外,即使简单干预也能部分恢复模型服从性,但会以牺牲推理性能为代价。这些发现揭示了当前LLM训练范式的根本矛盾,并表明需要开发更具指令感知能力的推理模型。代码与数据已发布于https://github.com/TingchenFu/MathIF。


Text Generation Beyond Discrete Token Sampling

Abstract

arXiv:2505.14827v1 Announce Type: cross Abstract: In standard autoregressive generation, an LLM predicts the next-token distribution, samples a discrete token, and then discards the distribution, passing only the sampled token as new input. To preserve this distribution's rich information, we propose Mixture of Inputs (MoI), a training-free method for autoregressive generation. After generating a token following the standard paradigm, we construct a new input that blends the generated discrete token with the previously discarded token distribution. Specifically, we employ a Bayesian estimation method that treats the token distribution as the prior, the sampled token as the observation, and replaces the conventional one-hot vector with the continuous posterior expectation as the new model input. MoI allows the model to maintain a richer internal representation throughout the generation process, resulting in improved text quality and reasoning capabilities. On mathematical reasoning, code generation, and PhD-level QA tasks, MoI consistently improves performance across multiple models including QwQ-32B, Nemotron-Super-49B, Gemma-3-27B, and DAPO-Qwen-32B, with no additional training and negligible computational overhead.

摘要

在标准的自回归生成中,大型语言模型(LLM)会预测下一个令牌的分布,采样一个离散令牌,然后丢弃该分布,仅将采样的令牌作为新输入传递。为了保留这一分布所蕴含的丰富信息,我们提出了输入混合(Mixture of Inputs, MoI)方法——一种无需训练的自回归生成技术。该方法在遵循标准范式生成令牌后,会构建一个融合了已生成离散令牌与先前被丢弃令牌分布的新输入。具体而言,我们采用贝叶斯估计方法,将令牌分布视为先验概率,采样令牌作为观测值,并用连续后验期望替代传统的独热向量作为新模型输入。MoI使模型能在整个生成过程中维持更丰富的内部表征,从而提升文本质量和推理能力。在数学推理、代码生成和博士级问答任务中,MoI无需额外训练且计算开销可忽略不计的情况下,持续提升了包括QwQ-32B、Nemotron-Super-49B、Gemma-3-27B和DAPO-Qwen-32B在内的多个模型的性能表现。


Quaff: Quantized Parameter-Efficient Fine-Tuning under Outlier Spatial Stability Hypothesis

Abstract

arXiv:2505.14742v1 Announce Type: cross Abstract: Large language models (LLMs) have made exciting achievements across various domains, yet their deployment on resource-constrained personal devices remains hindered by the prohibitive computational and memory demands of task-specific fine-tuning. While quantization offers a pathway to efficiency, existing methods struggle to balance performance and overhead, either incurring high computational/memory costs or failing to address activation outliers, a critical bottleneck in quantized fine-tuning. To address these challenges, we propose the Outlier Spatial Stability Hypothesis (OSSH): During fine-tuning, certain activation outlier channels retain stable spatial positions across training iterations. Building on OSSH, we propose Quaff, a Quantized parameter-efficient fine-tuning framework for LLMs, optimizing low-precision activation representations through targeted momentum scaling. Quaff dynamically suppresses outliers exclusively in invariant channels using lightweight operations, eliminating full-precision weight storage and global rescaling while reducing quantization errors. Extensive experiments across ten benchmarks validate OSSH and demonstrate Quaff's efficacy. Specifically, on the GPQA reasoning benchmark, Quaff achieves a 1.73x latency reduction and 30% memory savings over full-precision fine-tuning while improving accuracy by 0.6% on the Phi-3 model, reconciling the triple trade-off between efficiency, performance, and deployability. By enabling consumer-grade GPU fine-tuning (e.g., RTX 2080 Super) without sacrificing model utility, Quaff democratizes personalized LLM deployment. The code is available at https://github.com/Little0o0/Quaff.git.

摘要

大型语言模型(LLMs)在各领域取得了令人瞩目的成就,但其在资源受限的个人设备上的部署仍受限于任务特定微调所需的高昂计算与内存开销。尽管量化技术提供了效率提升路径,现有方法难以平衡性能与开销——要么导致高计算/内存成本,要么无法处理激活异常值这一量化微调中的关键瓶颈。针对这些挑战,我们提出"异常值空间稳定性假说"(OSSH):在微调过程中,特定激活异常通道会保持跨训练迭代的空间位置稳定性。基于OSSH,我们提出量化参数高效微调框架Quaff,通过定向动量缩放优化低精度激活表示。Quaff利用轻量级操作动态抑制不变通道中的异常值,无需全精度权重存储和全局重缩放,同时降低量化误差。在十个基准测试上的广泛实验验证了OSSH假说并证明了Quaff的有效性。具体而言,在GPQA推理基准测试中,Quaff相比全精度微调实现了1.73倍的延迟降低和30%的内存节省,同时在Phi-3模型上准确率提升0.6%,实现了效率、性能与可部署性的三重平衡。通过在不牺牲模型效用的前提下支持消费级GPU(如RTX 2080 Super)微调,Quaff推动了个性化LLM部署的普及。代码已开源:https://github.com/Little0o0/Quaff.git。


A Comparative Study of Large Language Models and Human Personality Traits

Abstract

arXiv:2505.14845v1 Announce Type: cross Abstract: Large Language Models (LLMs) have demonstrated human-like capabilities in language comprehension and generation, becoming active participants in social and cognitive domains. This study investigates whether LLMs exhibit personality-like traits and how these traits compare with human personality, focusing on the applicability of conventional personality assessment tools. A behavior-based approach was used across three empirical studies. Study 1 examined test-retest stability and found that LLMs show higher variability and are more input-sensitive than humans, lacking long-term stability. Based on this, we propose the Distributed Personality Framework, conceptualizing LLM traits as dynamic and input-driven. Study 2 analyzed cross-variant consistency in personality measures and found LLMs' responses were highly sensitive to item wording, showing low internal consistency compared to humans. Study 3 explored personality retention during role-playing, showing LLM traits are shaped by prompt and parameter settings. These findings suggest that LLMs express fluid, externally dependent personality patterns, offering insights for constructing LLM-specific personality frameworks and advancing human-AI interaction. This work contributes to responsible AI development and extends the boundaries of personality psychology in the age of intelligent systems.

摘要

大语言模型(LLMs)在语言理解与生成方面展现出类人能力,已成为社会和认知领域的积极参与者。本研究探讨LLMs是否表现出类人格特质,以及这些特质与人类人格的异同,重点关注传统人格评估工具的适用性。通过三项实证研究采用基于行为的方法:研究1检验了重测稳定性,发现LLMs较人类表现出更高变异性和输入敏感性,缺乏长期稳定性。据此我们提出分布式人格框架,将LLM特质概念化为动态且输入驱动的属性。研究2分析了人格测量的跨版本一致性,发现LLMs的响应高度依赖条目措辞,与人类相比内部一致性较低。研究3探究角色扮演中的人格保持性,显示LLM特质受提示语和参数设置塑造。这些发现表明LLMs表达出流动的、外部依赖性的人格模式,为构建LLM专用人格框架和推进人机交互提供了新见解。本工作有助于负责任AI发展,并拓展了智能系统时代人格心理学的边界。


MAATS: A Multi-Agent Automated Translation System Based on MQM Evaluation

Abstract

arXiv:2505.14848v1 Announce Type: cross Abstract: We present MAATS, a Multi Agent Automated Translation System that leverages the Multidimensional Quality Metrics (MQM) framework as a fine-grained signal for error detection and refinement. MAATS employs multiple specialized AI agents, each focused on a distinct MQM category (e.g., Accuracy, Fluency, Style, Terminology), followed by a synthesis agent that integrates the annotations to iteratively refine translations. This design contrasts with conventional single-agent methods that rely on self-correction. Evaluated across diverse language pairs and Large Language Models (LLMs), MAATS outperforms zero-shot and single-agent baselines with statistically significant gains in both automatic metrics and human assessments. It excels particularly in semantic accuracy, locale adaptation, and linguistically distant language pairs. Qualitative analysis highlights its strengths in multi-layered error diagnosis, omission detection across perspectives, and context-aware refinement. By aligning modular agent roles with interpretable MQM dimensions, MAATS narrows the gap between black-box LLMs and human translation workflows, shifting focus from surface fluency to deeper semantic and contextual fidelity.

摘要

我们提出MAATS(多智能体自动翻译系统),该系统利用多维质量指标(MQM)框架作为细粒度信号进行错误检测与优化。MAATS采用多个专用AI智能体,每个智能体专注于特定MQM类别(如准确性、流畅性、风格、术语),再由合成智能体整合标注以迭代优化翻译。这种设计与依赖自我校正的传统单智能体方法形成鲜明对比。

通过在多语言对和大语言模型(LLM)上的评估,MAATS在自动指标和人工评估中均显著优于零样本和单智能体基线,尤其在语义准确性、地域适应性及语言距离较远的语对中表现突出。定性分析表明其优势体现在多层错误诊断、多视角遗漏检测及上下文感知优化方面。通过将模块化智能体角色与可解释的MQM维度对齐,MAATS缩小了黑盒LLM与人工翻译流程间的差距,将优化重点从表层流畅性转向更深层的语义与上下文保真度。


WebNovelBench: Placing LLM Novelists on the Web Novel Distribution

Abstract

arXiv:2505.14818v1 Announce Type: cross Abstract: Robustly evaluating the long-form storytelling capabilities of Large Language Models (LLMs) remains a significant challenge, as existing benchmarks often lack the necessary scale, diversity, or objective measures. To address this, we introduce WebNovelBench, a novel benchmark specifically designed for evaluating long-form novel generation. WebNovelBench leverages a large-scale dataset of over 4,000 Chinese web novels, framing evaluation as a synopsis-to-story generation task. We propose a multi-faceted framework encompassing eight narrative quality dimensions, assessed automatically via an LLM-as-Judge approach. Scores are aggregated using Principal Component Analysis and mapped to a percentile rank against human-authored works. Our experiments demonstrate that WebNovelBench effectively differentiates between human-written masterpieces, popular web novels, and LLM-generated content. We provide a comprehensive analysis of 24 state-of-the-art LLMs, ranking their storytelling abilities and offering insights for future development. This benchmark provides a scalable, replicable, and data-driven methodology for assessing and advancing LLM-driven narrative generation.

摘要

稳健评估大语言模型(LLMs)的长篇叙事能力仍存在重大挑战,现有基准测试往往缺乏必要的规模、多样性或客观衡量标准。为此,我们提出WebNovelBench——一个专为评估长篇小说生成而设计的新型基准。该基准利用包含4,000余部中文网络小说的大规模数据集,将评估任务构建为'概要到故事'的生成框架。我们提出一个包含八个叙事质量维度的多层面评估体系,通过'LLM即评委'方法实现自动化测评,并采用主成分分析法聚合分数后映射至人类作品的百分位排名。实验表明,WebNovelBench能有效区分人类创作的经典作品、流行网络小说与LLM生成内容。我们对24个前沿LLM进行了全面分析,排序其叙事能力并为未来发展提供洞见。该基准为评估和推进LLM驱动的叙事生成提供了可扩展、可复现且数据驱动的方法论。


Polar Sparsity: High Throughput Batched LLM Inferencing with Scalable Contextual Sparsity

Abstract

arXiv:2505.14884v1 Announce Type: cross Abstract: Accelerating large language model (LLM) inference is critical for real-world deployments requiring high throughput and low latency. Contextual sparsity, where each token dynamically activates only a small subset of the model parameters, shows promise but does not scale to large batch sizes due to union of active neurons quickly approaching dense computation. We introduce Polar Sparsity, highlighting a key shift in sparsity importance from MLP to Attention layers as we scale batch size and sequence length. While MLP layers become more compute-efficient under batching, their sparsity vanishes. In contrast, attention becomes increasingly more expensive at scale, while their head sparsity remains stable and batch-invariant. We develop hardware-efficient, sparsity-aware GPU kernels for selective MLP and Attention computations, delivering up to (2.2\times) end-to-end speedups for models like OPT, LLaMA-2 & 3, across various batch sizes and sequence lengths without compromising accuracy. To our knowledge, this is the first work to demonstrate that contextual sparsity can scale effectively to large batch sizes, delivering substantial inference acceleration with minimal changes, making Polar Sparsity practical for large-scale, high-throughput LLM deployment systems. Our code is available at: https://github.com/susavlsh10/Polar-Sparsity.

摘要

加速大型语言模型(LLM)推理对于需要高吞吐量和低延迟的实际部署至关重要。上下文稀疏性(即每个令牌动态激活仅一小部分模型参数)虽展现出潜力,但由于活跃神经元的并集迅速逼近密集计算,该方法难以扩展至大批量场景。我们提出极性稀疏性,揭示了当批量大小与序列长度增加时,稀疏性重要性从MLP层向注意力层的关键转变:MLP层在批处理下计算效率提升但其稀疏性消失,而注意力计算成本随规模增长显著增加,其头部稀疏性却保持稳定且与批量无关。我们开发了硬件高效的稀疏感知GPU内核,用于选择性MLP和注意力计算,在保持精度前提下为OPT、LLaMA-2/3等模型在不同批量与序列长度下带来最高2.2倍的端到端加速。据我们所知,这是首个证明上下文稀疏性可有效扩展至大批量的研究,通过极简修改实现显著推理加速,使极性稀疏性适用于大规模高吞吐LLM部署系统。代码已开源:https://github.com/susavlsh10/Polar-Sparsity。


Soft Prompts for Evaluation: Measuring Conditional Distance of Capabilities

Abstract

arXiv:2505.14943v1 Announce Type: cross Abstract: To help evaluate and understand the latent capabilities of language models, this paper introduces an approach using optimized input embeddings, or 'soft prompts,' as a metric of conditional distance between a model and a target behavior. The technique aims to facilitate latent capability discovery as a part of automated red teaming/evaluation suites and to provide quantitative feedback about the accessibility of potentially concerning behaviors in a way that may scale to powerful future models, including those which may otherwise be capable of deceptive alignment. An evaluation framework using soft prompts is demonstrated in natural language, chess, and pathfinding, and the technique is extended with generalized conditional soft prompts to aid in constructing task evaluations.

摘要

为帮助评估和理解语言模型的潜在能力,本文提出一种采用优化输入嵌入(即"软提示")作为模型与目标行为间条件距离度量指标的方法。该技术旨在将潜在能力发现作为自动化红队测试/评估套件的组成部分,并通过可量化的反馈机制评估潜在风险行为的可及性,这种方法未来可扩展至更强大的模型(包括那些可能具备欺骗性对齐能力的模型)。研究通过自然语言处理、国际象棋和路径规划三个领域展示了基于软提示的评估框架,并进一步提出广义条件软提示技术以辅助构建任务评估体系。


Scaling Laws for State Dynamics in Large Language Models

Abstract

arXiv:2505.14892v1 Announce Type: cross Abstract: Large Language Models (LLMs) are increasingly used in tasks requiring internal state tracking, yet their ability to model state transition dynamics remains poorly understood. We evaluate how well LLMs capture deterministic state dynamics across 3 domains: Box Tracking, Abstract DFA Sequences, and Complex Text Games, each formalizable as a finite-state system. Across tasks, we find that next-state prediction accuracy degrades with increasing state-space size and sparse transitions. GPT-2 XL reaches about 70% accuracy in low-complexity settings but drops below 30% when the number of boxes or states exceeds 5 or 10, respectively. In DFA tasks, Pythia-1B fails to exceed 50% accuracy when the number of states is > 10 and transitions are < 30. Through activation patching, we identify attention heads responsible for propagating state information: GPT-2 XL Layer 22 Head 20, and Pythia-1B Heads at Layers 10, 11, 12, and 14. While these heads successfully move relevant state features, action information is not reliably routed to the final token, indicating weak joint state-action reasoning. Our results suggest that state tracking in LLMs emerges from distributed interactions of next-token heads rather than explicit symbolic computation.

摘要

大型语言模型(LLMs)在需要内部状态追踪的任务中应用日益广泛,但其对状态转移动态的建模能力仍不甚明晰。本研究评估了LLMs在三个可形式化为有限状态系统的领域(方块追踪、抽象DFA序列和复杂文本游戏)中捕捉确定性状态动态的能力。实验发现:跨任务场景下,下一状态预测准确率随状态空间规模扩大和转移稀疏性增加而下降。GPT-2 XL在低复杂度环境下可达约70%准确率,但当方块数量或状态数分别超过5或10时,准确率降至30%以下;在DFA任务中,当状态数>10且转移数<30时,Pythia-1B模型准确率无法突破50%。通过激活修补技术,我们识别出负责状态信息传播的注意力头:GPT-2 XL第22层第20号头,以及Pythia-1B第10、11、12和14层的注意力头。虽然这些头能成功传递相关状态特征,但动作信息未被可靠路由至最终标记,表明联合状态-动作推理能力薄弱。研究结果表明,LLMs中的状态追踪源于下一标记头的分布式交互,而非显式的符号计算。


Too Long, Didn't Model: Decomposing LLM Long-Context Understanding With Novels

Abstract

arXiv:2505.14925v1 Announce Type: cross Abstract: Although the context length of large language models (LLMs) has increased to millions of tokens, evaluating their effectiveness beyond needle-in-a-haystack approaches has proven difficult. We argue that novels provide a case study of subtle, complicated structure and long-range semantic dependencies often over 128k tokens in length. Inspired by work on computational novel analysis, we release the Too Long, Didn't Model (TLDM) benchmark, which tests a model's ability to report plot summary, storyworld configuration, and elapsed narrative time. We find that none of seven tested frontier LLMs retain stable understanding beyond 64k tokens. Our results suggest language model developers must look beyond "lost in the middle" benchmarks when evaluating model performance in complex long-context scenarios. To aid in further development we release the TLDM benchmark together with reference code and data.

摘要

尽管大型语言模型(LLMs)的上下文长度已扩展至数百万标记,但评估其在"大海捞针"式测试之外的有效性仍具挑战。我们认为小说可作为研究复杂精细结构及长程语义依赖(通常超过128k标记)的理想案例。受计算小说分析研究的启发,我们发布了"太长未建模"(TLDM)基准测试,用于评估模型在情节摘要复述、故事世界构型识别及叙事时间跨度推算方面的能力。测试发现,七款前沿LLMs在超过64k标记后均无法保持稳定理解。结果表明,语言模型开发者必须超越"中间丢失"类基准,才能准确评估模型在复杂长上下文场景中的表现。为促进后续研究,我们同步公开了TLDM基准测试及其参考代码与数据集。


JARVIS: A Multi-Agent Code Assistant for High-Quality EDA Script Generation

Abstract

arXiv:2505.14978v1 Announce Type: cross Abstract: This paper presents JARVIS, a novel multi-agent framework that leverages Large Language Models (LLMs) and domain expertise to generate high-quality scripts for specialized Electronic Design Automation (EDA) tasks. By combining a domain-specific LLM trained with synthetically generated data, a custom compiler for structural verification, rule enforcement, code fixing capabilities, and advanced retrieval mechanisms, our approach achieves significant improvements over state-of-the-art domain-specific models. Our framework addresses the challenges of data scarcity and hallucination errors in LLMs, demonstrating the potential of LLMs in specialized engineering domains. We evaluate our framework on multiple benchmarks and show that it outperforms existing models in terms of accuracy and reliability. Our work sets a new precedent for the application of LLMs in EDA and paves the way for future innovations in this field.

摘要

本文提出JARVIS——一种新型多智能体框架,该框架通过结合大语言模型(LLM)与领域专业知识,为专用电子设计自动化(EDA)任务生成高质量脚本。我们的方法整合了基于合成数据训练的领域专用LLM、用于结构验证的自定义编译器、规则强制执行与代码修复功能以及高级检索机制,相比当前最先进的领域专用模型取得了显著改进。该框架有效解决了LLM在专业工程领域中面临的数据稀缺和幻觉错误等挑战,彰显了LLM在专业工程领域的应用潜力。我们在多个基准测试上评估了该框架,结果表明其在准确性和可靠性方面均优于现有模型。本研究为LLM在EDA领域的应用树立了新标杆,并为该领域的未来创新奠定了基础。


Programmatic Video Prediction Using Large Language Models

Abstract

arXiv:2505.14948v1 Announce Type: cross Abstract: The task of estimating the world model describing the dynamics of a real world process assumes immense importance for anticipating and preparing for future outcomes. For applications such as video surveillance, robotics applications, autonomous driving, etc. this objective entails synthesizing plausible visual futures, given a few frames of a video to set the visual context. Towards this end, we propose ProgGen, which undertakes the task of video frame prediction by representing the dynamics of the video using a set of neuro-symbolic, human-interpretable set of states (one per frame) by leveraging the inductive biases of Large (Vision) Language Models (LLM/VLM). In particular, ProgGen utilizes LLM/VLM to synthesize programs: (i) to estimate the states of the video, given the visual context (i.e. the frames); (ii) to predict the states corresponding to future time steps by estimating the transition dynamics; (iii) to render the predicted states as visual RGB-frames. Empirical evaluations reveal that our proposed method outperforms competing techniques at the task of video frame prediction in two challenging environments: (i) PhyWorld (ii) Cart Pole. Additionally, ProgGen permits counter-factual reasoning and interpretable video generation attesting to its effectiveness and generalizability for video generation tasks.

摘要

估计描述现实世界过程动态的世界模型这一任务,对于预测和准备未来结果具有重要意义。在视频监控、机器人应用、自动驾驶等应用中,该目标需要在给定少量视频帧设定视觉上下文的情况下,合成合理的视觉未来。为此,我们提出了ProgGen,它通过利用大型(视觉)语言模型(LLM/VLM)的归纳偏置,将视频动态表示为一组神经符号化、人类可解释的状态(每帧一个状态),从而完成视频帧预测任务。具体而言,ProgGen利用LLM/VLM合成程序:(i)在给定视觉上下文(即帧)的情况下估计视频状态;(ii)通过估计过渡动态预测未来时间步对应的状态;(iii)将预测状态渲染为视觉RGB帧。实证评估表明,我们提出的方法在两个具有挑战性的环境(i)PhyWorld和(ii)Cart Pole中,在视频帧预测任务上优于竞争技术。此外,ProgGen支持反事实推理和可解释的视频生成,证明了其在视频生成任务中的有效性和泛化能力。


STree: Speculative Tree Decoding for Hybrid State-Space Models

Abstract

arXiv:2505.14969v1 Announce Type: cross Abstract: Speculative decoding is a technique to leverage hardware concurrency to improve the efficiency of large-scale autoregressive (AR) Transformer models by enabling multiple steps of token generation in a single forward pass. State-space models (SSMs) are already more efficient than AR Transformers, since their state summarizes all past data with no need to cache or re-process tokens in the sliding window context. However, their state can also comprise thousands of tokens; so, speculative decoding has recently been extended to SSMs. Existing approaches, however, do not leverage the tree-based verification methods, since current SSMs lack the means to compute a token tree efficiently. We propose the first scalable algorithm to perform tree-based speculative decoding in state-space models (SSMs) and hybrid architectures of SSMs and Transformer layers. We exploit the structure of accumulated state transition matrices to facilitate tree-based speculative decoding with minimal overhead to current SSM state update implementations. With the algorithm, we describe a hardware-aware implementation that improves naive application of AR Transformer tree-based speculative decoding methods to SSMs. Furthermore, we outperform vanilla speculative decoding with SSMs even with a baseline drafting model and tree structure on three different benchmarks, opening up opportunities for further speed up with SSM and hybrid model inference. Code will be released upon paper acceptance.

摘要

推测解码是一种利用硬件并发性提升自回归(AR)Transformer模型效率的技术,通过单次前向传播实现多步令牌生成。状态空间模型(SSMs)本身已比AR Transformer更高效,因其状态可汇总所有历史数据,无需缓存或重新处理滑动窗口上下文中的令牌。然而,其状态也可能包含数千个令牌,因此推测解码技术近期被扩展至SSMs领域。但现有方法未能利用基于树的验证机制,因当前SSMs缺乏高效计算令牌树的方法。我们提出首个可扩展算法,用于在状态空间模型(SSMs)及SSM与Transformer层的混合架构中实现基于树的推测解码。通过利用累积状态转移矩阵的结构,我们在现有SSM状态更新实现上以最小开销实现了基于树的推测解码。基于该算法,我们提出一种硬件感知的实现方案,改进了AR Transformer树基推测解码方法在SSMs中的直接应用。实验表明,在三个不同基准测试中,即使采用基线草稿模型和树结构,我们的方法仍优于SSMs的原始推测解码,为SSM及混合模型推理的进一步加速开辟了新途径。代码将在论文录用后公开。


Meta-Design Matters: A Self-Design Multi-Agent System

Abstract

arXiv:2505.14996v1 Announce Type: cross Abstract: Multi-agent systems (MAS) leveraging the impressive capabilities of Large Language Models (LLMs) hold significant potential for tackling complex tasks. However, most current MAS depend on manually designed agent roles and communication protocols. These manual designs often fail to align with the underlying LLMs' strengths and struggle to adapt to novel tasks. Recent automatic MAS approaches attempt to mitigate these limitations but typically necessitate a validation-set for tuning and yield static MAS designs lacking adaptability during inference. We introduce SELF-MAS, the first self-supervised, inference-time only framework for automatic MAS design. SELF-MAS employs meta-level design to iteratively generate, evaluate, and refine MAS configurations tailored to each problem instance, without requiring a validation set. Critically, it enables dynamic agent composition and problem decomposition through meta-feedback on solvability and completeness. Experiments across math, graduate-level QA, and software engineering benchmarks, using both closed-source and open-source LLM back-bones of varying sizes, demonstrate that SELF-MAS outperforms both manual and automatic MAS baselines, achieving a 7.44% average accuracy improvement over the next strongest baseline while maintaining cost-efficiency. These findings underscore the promise of meta-level self-supervised design for creating effective and adaptive MAS.

摘要

利用大型语言模型(LLM)强大能力的多智能体系统(MAS)在解决复杂任务方面具有重要潜力。然而,当前大多数MAS依赖于人工设计的智能体角色与通信协议。这些人工设计往往无法充分发挥底层LLM的优势,且难以适应新任务。近期自动化的MAS方法试图缓解这些限制,但通常需要验证集进行调优,并产生缺乏推理阶段适应性的静态MAS设计。我们提出SELF-MAS——首个仅需推理阶段的自监督自动化MAS设计框架。该方法通过元级设计迭代生成、评估并优化针对每个问题实例的MAS配置,无需验证集。其核心在于通过可解性与完备性的元反馈,实现动态的智能体组合与问题分解。在数学、研究生水平QA及软件工程基准测试上的实验表明(使用不同规模的闭源与开源LLM骨干模型),SELF-MAS在保持成本效益的同时,其性能优于人工与自动化MAS基线方法,平均准确率较次优基线提升7.44%。这些发现印证了元级自监督设计在构建高效自适应MAS方面的潜力。


Learning to Rank Chain-of-Thought: An Energy-Based Approach with Outcome Supervision

Abstract

arXiv:2505.14999v1 Announce Type: cross Abstract: Mathematical reasoning presents a significant challenge for Large Language Models (LLMs), often requiring robust multi step logical consistency. While Chain of Thought (CoT) prompting elicits reasoning steps, it doesn't guarantee correctness, and improving reliability via extensive sampling is computationally costly. This paper introduces the Energy Outcome Reward Model (EORM), an effective, lightweight, post hoc verifier. EORM leverages Energy Based Models (EBMs) to simplify the training of reward models by learning to assign a scalar energy score to CoT solutions using only outcome labels, thereby avoiding detailed annotations. It achieves this by interpreting discriminator output logits as negative energies, effectively ranking candidates where lower energy is assigned to solutions leading to correct final outcomes implicitly favoring coherent reasoning. On mathematical benchmarks (GSM8k, MATH), EORM significantly improves final answer accuracy (e.g., with Llama 3 8B, achieving 90.7% on GSM8k and 63.7% on MATH). EORM effectively leverages a given pool of candidate solutions to match or exceed the performance of brute force sampling, thereby enhancing LLM reasoning outcome reliability through its streamlined post hoc verification process.

摘要

数学推理对大型语言模型(LLM)构成重大挑战,通常需要强大的多步逻辑一致性。虽然思维链(CoT)提示能够引发推理步骤,但不能保证正确性,而通过大量采样提高可靠性又会导致计算成本高昂。本文提出能量结果奖励模型(EORM),一种高效、轻量级的事后验证器。EORM基于能量模型(EBM),通过学习仅使用结果标签为CoT解决方案分配标量能量分数,简化了奖励模型的训练,从而避免了详细标注。其实现方式是将判别器输出逻辑值解释为负能量,有效对候选方案进行排序——为导致正确最终结果的解决方案分配较低能量,隐式地偏好连贯推理。在数学基准测试(GSM8k、MATH)上,EORM显著提高了最终答案准确率(例如,使用Llama 3 8B模型时,在GSM8k上达到90.7%,在MATH上达到63.7%)。EORM能有效利用给定的候选解决方案池,达到或超越暴力采样的性能,从而通过其高效的事后验证流程提升LLM推理结果的可靠性。


Denoising Concept Vectors with Sparse Autoencoders for Improved Language Model Steering

Abstract

arXiv:2505.15038v1 Announce Type: cross Abstract: Linear Concept Vectors have proven effective for steering large language models (LLMs). While existing approaches like linear probing and difference-in-means derive these vectors from LLM hidden representations, diverse data introduces noises (i.e., irrelevant features) that challenge steering robustness. To address this, we propose Sparse Autoencoder-Denoised Concept Vectors (SDCV), which uses Sparse Autoencoders to filter out noisy features from hidden representations. When applied to linear probing and difference-in-means, our method improves their steering success rates. We validate our noise hypothesis through counterfactual experiments and feature visualizations.

摘要

线性概念向量已被证明能有效引导大语言模型(LLMs)。现有方法如线性探测和均值差异从LLM隐藏表示中推导这些向量,但多样化数据会引入噪声(即无关特征),影响引导的稳健性。为此,我们提出稀疏自编码去噪概念向量(SDCV),利用稀疏自编码器从隐藏表示中滤除噪声特征。当应用于线性探测和均值差异方法时,本方法显著提升了其引导成功率。我们通过反事实实验和特征可视化验证了噪声假设的合理性。


One-Layer Transformers are Provably Optimal for In-context Reasoning and Distributional Association Learning in Next-Token Prediction Tasks

Abstract

arXiv:2505.15009v1 Announce Type: cross Abstract: We study the approximation capabilities and on-convergence behaviors of one-layer transformers on the noiseless and noisy in-context reasoning of next-token prediction. Existing theoretical results focus on understanding the in-context reasoning behaviors for either the first gradient step or when the number of samples is infinite. Furthermore, no convergence rates nor generalization abilities were known. Our work addresses these gaps by showing that there exists a class of one-layer transformers that are provably Bayes-optimal with both linear and ReLU attention. When being trained with gradient descent, we show via a finite-sample analysis that the expected loss of these transformers converges at linear rate to the Bayes risk. Moreover, we prove that the trained models generalize to unseen samples as well as exhibit learning behaviors that were empirically observed in previous works. Our theoretical findings are further supported by extensive empirical validations.

摘要

我们研究了一层Transformer模型在无噪声和有噪声上下文推理中进行下一词预测时的近似能力与收敛行为。现有理论成果主要关注对首次梯度步长或无限样本情况下的上下文推理行为的理解,且尚未涉及收敛速率或泛化能力的分析。本研究通过证明存在一类具有线性注意力机制和ReLU注意力机制的单层Transformer可被理论证实为贝叶斯最优,填补了这些空白。在梯度下降训练过程中,我们通过有限样本分析表明这些Transformer的期望损失以线性速率收敛至贝叶斯风险。此外,我们证明了训练后的模型不仅能泛化到未见样本,还展现出先前实证研究中观察到的学习行为。大量实证验证进一步支持了我们的理论发现。


Traveling Across Languages: Benchmarking Cross-Lingual Consistency in Multimodal LLMs

Abstract

arXiv:2505.15075v1 Announce Type: cross Abstract: The rapid evolution of multimodal large language models (MLLMs) has significantly enhanced their real-world applications. However, achieving consistent performance across languages, especially when integrating cultural knowledge, remains a significant challenge. To better assess this issue, we introduce two new benchmarks: KnowRecall and VisRecall, which evaluate cross-lingual consistency in MLLMs. KnowRecall is a visual question answering benchmark designed to measure factual knowledge consistency in 15 languages, focusing on cultural and historical questions about global landmarks. VisRecall assesses visual memory consistency by asking models to describe landmark appearances in 9 languages without access to images. Experimental results reveal that state-of-the-art MLLMs, including proprietary ones, still struggle to achieve cross-lingual consistency. This underscores the need for more robust approaches that produce truly multilingual and culturally aware models.

摘要

多模态大语言模型(MLLMs)的快速发展显著提升了其现实应用能力。然而,在跨语言场景下(尤其是涉及文化知识整合时)保持性能一致性仍存在重大挑战。为系统评估该问题,我们提出两个新基准:KnowRecall和VisRecall,用于评测MLLMs的跨语言一致性。KnowRecall作为视觉问答基准,通过15种语言测试全球地标的文化历史类问题,衡量事实知识一致性;VisRecall则要求模型在无图像条件下用9种语言描述地标外观,评估视觉记忆一致性。实验表明,包括商业模型在内的最先进MLLMs仍难以实现跨语言一致性,这凸显了需要开发更健壮的方法来构建真正多语言且具备文化认知能力的模型。


RL Tango: Reinforcing Generator and Verifier Together for Language Reasoning

Abstract

arXiv:2505.15034v1 Announce Type: cross Abstract: Reinforcement learning (RL) has recently emerged as a compelling approach for enhancing the reasoning capabilities of large language models (LLMs), where an LLM generator serves as a policy guided by a verifier (reward model). However, current RL post-training methods for LLMs typically use verifiers that are fixed (rule-based or frozen pretrained) or trained discriminatively via supervised fine-tuning (SFT). Such designs are susceptible to reward hacking and generalize poorly beyond their training distributions. To overcome these limitations, we propose Tango, a novel framework that uses RL to concurrently train both an LLM generator and a verifier in an interleaved manner. A central innovation of Tango is its generative, process-level LLM verifier, which is trained via RL and co-evolves with the generator. Importantly, the verifier is trained solely based on outcome-level verification correctness rewards without requiring explicit process-level annotations. This generative RL-trained verifier exhibits improved robustness and superior generalization compared to deterministic or SFT-trained verifiers, fostering effective mutual reinforcement with the generator. Extensive experiments demonstrate that both components of Tango achieve state-of-the-art results among 7B/8B-scale models: the generator attains best-in-class performance across five competition-level math benchmarks and four challenging out-of-domain reasoning tasks, while the verifier leads on the ProcessBench dataset. Remarkably, both components exhibit particularly substantial improvements on the most difficult mathematical reasoning problems. Code is at: https://github.com/kaiwenzha/rl-tango.

摘要

强化学习(RL)近期已成为提升大语言模型(LLM)推理能力的重要方法,其核心是通过验证器(奖励模型)引导作为策略的LLM生成器。然而,当前LLM的RL后训练方法通常采用固定验证器(基于规则或冻结预训练模型)或通过监督微调(SFT)训练的判别式验证器。此类设计易受奖励破解影响,且在训练分布之外泛化能力较差。为突破这些限制,我们提出Tango框架——一种通过RL交替训练LLM生成器与验证器的新方法。Tango的核心创新在于其生成式、过程级LLM验证器,该验证器通过RL训练并与生成器协同进化。值得注意的是,验证器仅基于结果级验证正确性奖励进行训练,无需显式过程级标注。与确定性或SFT训练的验证器相比,这种RL训练的生成式验证器展现出更强的鲁棒性和泛化能力,有效促进与生成器的双向增强。大量实验表明,Tango的双组件在7B/8B规模模型中均取得最先进成果:生成器在五项竞赛级数学基准和四项跨领域推理任务中达到同类最佳性能,验证器则在ProcessBench数据集上领先。值得注意的是,双组件在最高难度数学推理问题上均表现出显著提升。代码见:https://github.com/kaiwenzha/rl-tango。


ChartCards: A Chart-Metadata Generation Framework for Multi-Task Chart Understanding

Abstract

arXiv:2505.15046v1 Announce Type: cross Abstract: The emergence of Multi-modal Large Language Models (MLLMs) presents new opportunities for chart understanding. However, due to the fine-grained nature of these tasks, applying MLLMs typically requires large, high-quality datasets for task-specific fine-tuning, leading to high data collection and training costs. To address this, we propose ChartCards, a unified chart-metadata generation framework for multi-task chart understanding. ChartCards systematically synthesizes various chart information, including data tables, visualization code, visual elements, and multi-dimensional semantic captions. By structuring this information into organized metadata, ChartCards enables a single chart to support multiple downstream tasks, such as text-to-chart retrieval, chart summarization, chart-to-table conversion, chart description, and chart question answering. Using ChartCards, we further construct MetaChart, a large-scale high-quality dataset containing 10,862 data tables, 85K charts, and 170 K high-quality chart captions. We validate the dataset through qualitative crowdsourcing evaluations and quantitative fine-tuning experiments across various chart understanding tasks. Fine-tuning six different models on MetaChart resulted in an average performance improvement of 5% across all tasks. The most notable improvements are seen in text-to-chart retrieval and chart-to-table tasks, with Long-CLIP and Llama 3.2-11B achieving improvements of 17% and 28%, respectively.

摘要

多模态大语言模型(MLLMs)的出现为图表理解带来了新的机遇。然而,由于这类任务具有细粒度特性,应用MLLMs通常需要大规模高质量数据集进行任务特定微调,导致数据收集和训练成本高昂。为此,我们提出ChartCards——一个支持多任务图表理解的统一图表元数据生成框架。该框架系统化合成各类图表信息,包括数据表格、可视化代码、视觉元素以及多维度语义描述。通过将这些信息组织为结构化元数据,ChartCards使得单个图表可支持多种下游任务,例如文本到图表检索、图表摘要、图表转表格、图表描述和图表问答。基于ChartCards,我们进一步构建了MetaChart数据集,这个大规模高质量数据集包含10,862个数据表格、85K张图表和170K条优质图表描述。我们通过定性众包评估和跨多种图表理解任务的定量微调实验验证了数据集质量。在MetaChart上微调的六种不同模型,所有任务平均性能提升达5%。其中文本到图表检索和图表转表格任务提升最为显著,Long-CLIP和Llama 3.2-11B模型分别实现了17%和28%的性能提升。


PiFlow: Principle-aware Scientific Discovery with Multi-Agent Collaboration

Abstract

arXiv:2505.15047v1 Announce Type: cross Abstract: Large Language Model (LLM)-based multi-agent systems (MAS) demonstrate remarkable potential for scientific discovery. Existing approaches, however, often automate scientific discovery using predefined workflows that lack rationality constraints. This often leads to aimless hypothesizing and a failure to consistently link hypotheses with evidence, thereby hindering systematic uncertainty reduction. Overcoming these limitations fundamentally requires systematic uncertainty reduction. We introduce \texttt{PiFlow}, an information-theoretical framework, treating automated scientific discovery as a structured uncertainty reduction problem guided by principles (e.g., scientific laws). In evaluations across three distinct scientific domains -- discovering nanomaterial structures, bio-molecules, and superconductor candidates with targeted properties -- our method significantly improves discovery efficiency, reflected by a 73.55% increase in the Area Under the Curve (AUC) of property values versus exploration steps, and enhances solution quality by 94.06% compared to a vanilla agent system. Overall, \texttt{PiFlow} serves as a Plug-and-Play method, establishing a novel paradigm shift in highly efficient automated scientific discovery, paving the way for more robust and accelerated AI-driven research. Code is publicly available at our \href{https://github.com/amair-lab/PiFlow&#125;&#123;GitHub&#125;.

摘要

基于大语言模型(LLM)的多智能体系统(MAS)在科学发现领域展现出显著潜力。然而,现有方法通常采用缺乏合理性约束的预定义工作流来实现科学发现自动化,这往往导致假设生成漫无目的,且无法持续将假设与证据相关联,从而阻碍系统性不确定性的降低。克服这些局限性的核心在于实现系统化的不确定性消减。我们提出\texttt{PiFlow}——一个信息理论框架,将自动化科学发现视为受科学定律等原则指导的结构化不确定性消减问题。在三个不同科学领域(具有目标特性的纳米材料结构发现、生物分子发现和超导体候选材料发现)的评估中,本方法显著提升了发现效率(属性值与探索步骤的曲线下面积AUC提升73.55%),并将解决方案质量较基线智能体系统提高94.06%。总体而言,\texttt{PiFlow}作为一种即插即用方法,建立了高效自动化科学发现的新范式,为更稳健、更快速的人工智能驱动研究铺平了道路。代码已公开于\href{https://github.com/amair-lab/PiFlow&#125;&#123;GitHub&#125;。


DISCO Balances the Scales: Adaptive Domain- and Difficulty-Aware Reinforcement Learning on Imbalanced Data

Abstract

arXiv:2505.15074v1 Announce Type: cross Abstract: Large Language Models (LLMs) are increasingly aligned with human preferences through Reinforcement Learning from Human Feedback (RLHF). Among RLHF methods, Group Relative Policy Optimization (GRPO) has gained attention for its simplicity and strong performance, notably eliminating the need for a learned value function. However, GRPO implicitly assumes a balanced domain distribution and uniform semantic alignment across groups - assumptions that rarely hold in real-world datasets. When applied to multi-domain, imbalanced data, GRPO disproportionately optimizes for dominant domains, neglecting underrepresented ones and resulting in poor generalization and fairness. We propose Domain-Informed Self-Consistency Policy Optimization (DISCO), a principled extension to GRPO that addresses inter-group imbalance with two key innovations. Domain-aware reward scaling counteracts frequency bias by reweighting optimization based on domain prevalence. Difficulty-aware reward scaling leverages prompt-level self-consistency to identify and prioritize uncertain prompts that offer greater learning value. Together, these strategies promote more equitable and effective policy learning across domains. Extensive experiments across multiple LLMs and skewed training distributions show that DISCO improves generalization, outperforms existing GRPO variants by 5% on Qwen3 models, and sets new state-of-the-art results on multi-domain alignment benchmarks.

摘要

大型语言模型(LLMs)正通过基于人类反馈的强化学习(RLHF)日益与人类偏好对齐。在众多RLHF方法中,组相对策略优化(GRPO)因其简洁性和卓越性能备受关注,其显著特点是无需学习价值函数。然而,GRPO隐含假设了均衡的领域分布和跨组语义对齐的一致性——这种假设在现实数据集中几乎无法成立。当应用于多领域不平衡数据时,GRPO会过度优化主导领域,忽视弱势领域,导致泛化能力与公平性下降。我们提出领域感知自洽策略优化(DISCO),这是GRPO的原则性扩展,通过两项关键创新解决组间不平衡问题:基于领域频率的奖励缩放通过领域流行度重加权优化来抵消频率偏差;难度感知奖励缩放利用提示级自洽性识别并优先处理具有更高学习价值的不确定性提示。这两种策略共同促进了跨领域更公平有效的策略学习。在多个LLM和倾斜训练分布上的大量实验表明,DISCO显著提升泛化能力,在Qwen3模型上以5%优势超越现有GRPO变体,并在多领域对齐基准测试中创造了新的最优性能记录。


Abstract

arXiv:2505.15088v1 Announce Type: cross Abstract: Command injection vulnerabilities are a significant security threat in dynamic languages like Python, particularly in widely used open-source projects where security issues can have extensive impact. With the proven effectiveness of Large Language Models(LLMs) in code-related tasks, such as testing, researchers have explored their potential for vulnerabilities analysis. This study evaluates the potential of large language models (LLMs), such as GPT-4, as an alternative approach for automated testing for vulnerability detection. In particular, LLMs have demonstrated advanced contextual understanding and adaptability, making them promising candidates for identifying nuanced security vulnerabilities within code. To evaluate this potential, we applied LLM-based analysis to six high-profile GitHub projects-Django, Flask, TensorFlow, Scikit-learn, PyTorch, and Langchain-each with over 50,000 stars and extensive adoption across software development and academic research. Our analysis assesses both the strengths and limitations of LLMs in detecting command injection vulnerabilities, evaluating factors such as detection accuracy, efficiency, and practical integration into development workflows. In addition, we provide a comparative analysis of different LLM tools to identify those most suitable for security applications. Our findings offer guidance for developers and security researchers on leveraging LLMs as innovative and automated approaches to enhance software security.

摘要

命令注入漏洞是Python等动态语言中的重大安全威胁,尤其在广泛使用的开源项目中,此类安全问题可能产生深远影响。随着大语言模型(LLMs)在代码相关任务(如测试)中有效性得到验证,研究者开始探索其在漏洞分析中的潜力。本研究评估了GPT-4等大语言模型作为自动化漏洞检测替代方案的可行性。LLMs展现出先进的上下文理解能力和适应性,使其成为识别代码中复杂安全漏洞的有力候选。为验证这一潜力,我们对六个知名GitHub项目(Django、Flask、TensorFlow、Scikit-learn、PyTorch和Langchain)进行了基于LLM的分析,这些项目均拥有超过5万星标并在软件开发和学术研究中广泛应用。我们的分析评估了LLMs在检测命令注入漏洞时的优势与局限,包括检测准确性、效率以及与开发流程的实际整合度。此外,我们通过对比不同LLM工具,筛选出最适合安全应用的模型。研究结果为开发者和安全研究人员提供了利用LLMs作为创新自动化方案来增强软件安全性的实践指导。


Self-GIVE: Associative Thinking from Limited Structured Knowledge for Enhanced Large Language Model Reasoning

Abstract

arXiv:2505.15062v1 Announce Type: cross Abstract: When addressing complex questions that require new information, people often associate the question with existing knowledge to derive a sensible answer. For instance, when evaluating whether melatonin aids insomnia, one might associate "hormones helping mental disorders" with "melatonin being a hormone and insomnia a mental disorder" to complete the reasoning. Large Language Models (LLMs) also require such associative thinking, particularly in resolving scientific inquiries when retrieved knowledge is insufficient and does not directly answer the question. Graph Inspired Veracity Extrapolation (GIVE) addresses this by using a knowledge graph (KG) to extrapolate structured knowledge. However, it involves the construction and pruning of many hypothetical triplets, which limits efficiency and generalizability. We propose Self-GIVE, a retrieve-RL framework that enhances LLMs with automatic associative thinking through reinforcement learning. Self-GIVE extracts structured information and entity sets to assist the model in linking to the queried concepts. We address GIVE's key limitations: (1) extensive LLM calls and token overhead for knowledge extrapolation, (2) difficulty in deploying on smaller LLMs (3B or 7B) due to complex instructions, and (3) inaccurate knowledge from LLM pruning. Specifically, after fine-tuning using self-GIVE with a 135 node UMLS KG, it improves the performance of the Qwen2.5 3B and 7B models by up to \textbf&#123;28.5%\rightarrow71.471.4%&#125; and \textbf&#123;78.6\rightarrow90.590.5%&#125; in samples \textbf&#123;unseen&#125; in challenging biomedical QA tasks. In particular, Self-GIVE allows the 7B model to match or outperform GPT3.5 turbo with GIVE, while cutting token usage by over 90%. Self-GIVE enhances the scalable integration of structured retrieval and reasoning with associative thinking.

摘要

在处理需要新信息的复杂问题时,人们常通过将问题与既有知识关联来推导合理答案。例如评估褪黑激素是否改善失眠时,可能将"激素有助于精神障碍"与"褪黑激素是激素且失眠属于精神障碍"相关联来完成推理。大语言模型(LLMs)同样需要这种关联思维,尤其在检索知识不足且无法直接回答问题时的科学查询场景。图启发真实性外推法(GIVE)通过知识图谱(KG)实现结构化知识外推,但涉及大量假设三元组的构建与剪枝,制约了效率与泛化能力。我们提出Self-GIVE检索强化学习框架,通过强化学习使LLMs具备自动关联思维能力。该方法提取结构化信息与实体集以辅助模型连接查询概念,解决了GIVE的三个关键局限:(1)知识外推需要大量LLM调用与token开销;(2)复杂指令导致难以部署于小型LLMs(3B/7B);(3)LLM剪枝产生的知识不准确。具体而言,在使用135节点UMLS KG进行self-GIVE微调后,Qwen2.5的3B与7B模型在生物医学QA难题未见样本中的表现分别提升至\textbf&#123;28.5%\rightarrow71.471.4%&#125;\textbf&#123;78.6\rightarrow90.590.5%&#125;。特别地,Self-GIVE使7B模型达到或超越GPT3.5 turbo搭配GIVE的表现,同时减少90%以上的token消耗。该方法增强了结构化检索与关联思维推理的可扩展集成。


StepSearch: Igniting LLMs Search Ability via Step-Wise Proximal Policy Optimization

Abstract

arXiv:2505.15107v1 Announce Type: cross Abstract: Efficient multi-hop reasoning requires Large Language Models (LLMs) based agents to acquire high-value external knowledge iteratively. Previous work has explored reinforcement learning (RL) to train LLMs to perform search-based document retrieval, achieving notable improvements in QA performance, but underperform on complex, multi-hop QA resulting from the sparse rewards from global signal only. To address this gap in existing research, we introduce StepSearch, a framework for search LLMs that trained with step-wise proximal policy optimization method. It consists of richer and more detailed intermediate search rewards and token-level process supervision based on information gain and redundancy penalties to better guide each search step. We constructed a fine-grained question-answering dataset containing sub-question-level search trajectories based on open source datasets through a set of data pipeline method. On standard multi-hop QA benchmarks, it significantly outperforms global-reward baselines, achieving 11.2% and 4.2% absolute improvements for 3B and 7B models over various search with RL baselines using only 19k training data, demonstrating the effectiveness of fine-grained, stepwise supervision in optimizing deep search LLMs. Our implementation is publicly available at https://github.com/zxh20001117/StepSearch.

摘要

高效的多跳推理要求基于大语言模型(LLM)的智能体通过迭代获取高价值外部知识。先前研究探索了利用强化学习(RL)训练LLM执行基于搜索的文档检索,在问答性能上取得显著提升,但在仅依赖全局稀疏奖励信号的复杂多跳问答任务中表现欠佳。为填补这一研究空白,我们提出StepSearch框架——采用逐步近端策略优化方法训练的搜索型LLM系统。该框架包含基于信息增益与冗余惩罚的更丰富中间搜索奖励机制,以及细粒度的词级过程监督,以更好地指导每个搜索步骤。通过数据管道方法,我们在开源数据集基础上构建了包含子问题级搜索轨迹的细粒度问答数据集。在标准多跳问答基准测试中,该方法显著优于全局奖励基线模型:3B和7B参数模型仅使用19k训练数据,就在各类RL搜索基线上分别实现11.2%和4.2%的绝对性能提升,证明了细粒度逐步监督对深度搜索LLM优化的有效性。项目代码已开源:https://github.com/zxh20001117/StepSearch。


ThinkRec: Thinking-based recommendation via LLM

Abstract

arXiv:2505.15091v1 Announce Type: cross Abstract: Recent advances in large language models (LLMs) have enabled more semantic-aware recommendations through natural language generation. Existing LLM for recommendation (LLM4Rec) methods mostly operate in a System 1-like manner, relying on superficial features to match similar items based on click history, rather than reasoning through deeper behavioral logic. This often leads to superficial and erroneous recommendations. Motivated by this, we propose ThinkRec, a thinking-based framework that shifts LLM4Rec from System 1 to System 2 (rational system). Technically, ThinkRec introduces a thinking activation mechanism that augments item metadata with keyword summarization and injects synthetic reasoning traces, guiding the model to form interpretable reasoning chains that consist of analyzing interaction histories, identifying user preferences, and making decisions based on target items. On top of this, we propose an instance-wise expert fusion mechanism to reduce the reasoning difficulty. By dynamically assigning weights to expert models based on users' latent features, ThinkRec adapts its reasoning path to individual users, thereby enhancing precision and personalization. Extensive experiments on real-world datasets demonstrate that ThinkRec significantly improves the accuracy and interpretability of recommendations. Our implementations are available in anonymous Github: https://anonymous.4open.science/r/ThinkRec_LLM.

摘要

大语言模型(LLM)的最新进展通过自然语言生成实现了更具语义感知的推荐。现有基于LLM的推荐方法(LLM4Rec)大多以类似系统1的方式运作,依赖表层特征根据点击历史匹配相似项目,而非通过更深层的行为逻辑进行推理,这往往导致推荐结果流于表面且存在错误。受此启发,我们提出ThinkRec——一个基于思考的框架,将LLM4Rec从系统1转向系统2(理性系统)。技术上,ThinkRec引入了思考激活机制,通过关键词摘要增强项目元数据并注入合成推理轨迹,引导模型形成可解释的推理链,包括分析交互历史、识别用户偏好以及基于目标项目做出决策。在此基础上,我们提出实例级专家融合机制以降低推理难度。通过根据用户潜在特征动态分配专家模型权重,ThinkRec能针对个体用户调整推理路径,从而提升推荐的精确性与个性化。在真实数据集上的大量实验表明,ThinkRec显著提高了推荐的准确性和可解释性。实现代码已发布于匿名GitHub:https://anonymous.4open.science/r/ThinkRec_LLM。


DeFTX: Denoised Sparse Fine-Tuning for Zero-Shot Cross-Lingual Transfer

Abstract

arXiv:2505.15090v1 Announce Type: cross Abstract: Effective cross-lingual transfer remains a critical challenge in scaling the benefits of large language models from high-resource to low-resource languages. Towards this goal, prior studies have explored many approaches to combine task knowledge from task-specific data in a (high-resource) source language and language knowledge from unlabeled text in a (low-resource) target language. One notable approach proposed composable sparse fine-tuning (SFT) for cross-lingual transfer that learns task-specific and language-specific sparse masks to select a subset of the pretrained model's parameters that are further fine-tuned. These sparse fine-tuned vectors (SFTs) are subsequently composed with the pretrained model to facilitate zero-shot cross-lingual transfer to a task in a target language, using only task-specific data from a source language. These sparse masks for SFTs were identified using a simple magnitude-based pruning. In our work, we introduce DeFT-X, a novel composable SFT approach that denoises the weight matrices of a pretrained model before magnitude pruning using singular value decomposition, thus yielding more robust SFTs. We evaluate DeFT-X on a diverse set of extremely low-resource languages for sentiment classification (NusaX) and natural language inference (AmericasNLI) and demonstrate that it performs at par or outperforms SFT and other prominent cross-lingual transfer baselines.

摘要

实现有效的跨语言迁移仍是扩大大型语言模型从高资源语言向低资源语言应用效益的关键挑战。为此,先前研究探索了多种方法,旨在结合(高资源)源语言中任务特定数据所蕴含的任务知识,以及(低资源)目标语言中未标注文本所包含的语言知识。其中一种显著方法是提出可组合稀疏微调(SFT)技术,该方法通过学习任务特定和语言特定的稀疏掩码,从预训练模型参数中选择子集进行进一步微调。这些稀疏微调向量(SFTs)随后与预训练模型组合,仅需源语言的任务特定数据即可促进目标语言任务的零样本跨语言迁移。原始SFT稀疏掩码通过简单的基于幅度的剪枝方法确定。本研究中,我们提出DeFT-X——一种新颖的可组合SFT方法,该方法在幅度剪枝前利用奇异值分解对预训练模型权重矩阵进行去噪,从而生成更具鲁棒性的SFTs。我们在情感分类(NusaX)和自然语言推理(AmericasNLI)任务中针对多种极低资源语言评估DeFT-X,结果表明其性能与SFT相当或优于SFT及其他主流跨语言迁移基线方法。


SUS backprop: linear backpropagation algorithm for long inputs in transformers

Abstract

arXiv:2505.15080v1 Announce Type: cross Abstract: It is straightforward to design an unbiased gradient estimator that stochastically cuts the backpropagation flow through any part of a computational graph. By cutting the parts that have little effect on the computation, one can potentially save a significant amount of back-propagation computation in exchange for a minimal increase in the stochastic gradient variance, in some situations. Such a situation occurs in the attention mechanism of the transformer architecture. For long sequences, attention becomes the limiting factor, as its compute requirements increase quadratically with sequence length nn. At the same time, most attention weights become very small, as most attention heads tend to connect a given token with only a small fraction of other tokens in the sequence. These weights become promising targets for cutting backpropagation. We propose a simple probabilistic rule controlled by a single parameter cc that cuts backpropagation through most attention weights, leaving at most cc interactions per token per attention head. This brings a factor of c/nc/n reduction in the compute required for the attention backpropagation, turning it from quadratic O(n2)O(n^2) to linear complexity O(nc)O(nc). We have empirically verified that, for a typical transformer model, cutting 99%99\% of the attention gradient flow (i.e. choosing c2030c \sim 20-30) results in relative gradient variance increase of only about 1%1\% for n2000n \sim 2000, and it decreases with nn. This approach is amenable to efficient sparse matrix implementation, thus being promising for making the cost of a backward pass negligible relative to the cost of a forward pass when training a transformer model on long sequences.

摘要

设计一种无偏梯度估计器来随机切断计算图中任意部分的反向传播路径是直接可行的。通过切断对计算影响较小的部分,在某些情况下可以显著节省反向传播计算量,仅需付出随机梯度方差的微小增加。这种情形出现在Transformer架构的注意力机制中。对于长序列,注意力成为限制因素,其计算需求随序列长度n呈二次方增长。与此同时,大多数注意力权重变得非常小,因为多数注意力头通常只将给定标记与序列中一小部分其他标记相连。这些权重成为切断反向传播的理想目标。我们提出一个由单一参数c控制的简单概率规则,该规则可切断大多数注意力权重的反向传播,使每个注意力头中每个标记最多保留c个交互。这将注意力反向传播的计算需求降低c/n倍,使其从二次复杂度O(n²)降至线性复杂度O(nc)。实验验证表明,对于典型Transformer模型,当n≈2000时切断99%的注意力梯度流(即选择c≈20-30)仅导致约1%的相对梯度方差增长,且该增长随n增大而减小。这种方法适合高效的稀疏矩阵实现,有望使长序列训练时反向传播的计算成本相对于前向传播变得微不足道。


A Risk Taxonomy for Evaluating AI-Powered Psychotherapy Agents

Abstract

arXiv:2505.15108v1 Announce Type: cross Abstract: The proliferation of Large Language Models (LLMs) and Intelligent Virtual Agents acting as psychotherapists presents significant opportunities for expanding mental healthcare access. However, their deployment has also been linked to serious adverse outcomes, including user harm and suicide, facilitated by a lack of standardized evaluation methodologies capable of capturing the nuanced risks of therapeutic interaction. Current evaluation techniques lack the sensitivity to detect subtle changes in patient cognition and behavior during therapy sessions that may lead to subsequent decompensation. We introduce a novel risk taxonomy specifically designed for the systematic evaluation of conversational AI psychotherapists. Developed through an iterative process including review of the psychotherapy risk literature, qualitative interviews with clinical and legal experts, and alignment with established clinical criteria (e.g., DSM-5) and existing assessment tools (e.g., NEQ, UE-ATR), the taxonomy aims to provide a structured approach to identifying and assessing user/patient harms. We provide a high-level overview of this taxonomy, detailing its grounding, and discuss potential use cases. We discuss two use cases in detail: monitoring cognitive model-based risk factors during a counseling conversation to detect unsafe deviations, in both human-AI counseling sessions and in automated benchmarking of AI psychotherapists with simulated patients. The proposed taxonomy offers a foundational step towards establishing safer and more responsible innovation in the domain of AI-driven mental health support.

摘要

大型语言模型(LLMs)和作为心理治疗师的智能虚拟代理的激增,为扩大心理健康服务的可及性提供了重要机遇。然而,由于缺乏能够捕捉治疗互动中微妙风险的标准化评估方法,其部署也导致了包括用户伤害和自杀在内的严重不良后果。现有评估技术对治疗过程中患者认知和行为的细微变化(这些变化可能导致后续失代偿)缺乏敏感性。我们提出了一种专门用于系统评估对话式AI心理治疗师的新型风险分类法。该分类法通过迭代过程开发而成,包括对心理治疗风险文献的回顾、与临床和法律专家的定性访谈,以及与既定临床标准(如DSM-5)和现有评估工具(如NEQ、UE-ATR)的对齐,旨在为识别和评估用户/患者伤害提供结构化方法。我们对该分类法进行了高层概述,详细说明了其基础,并讨论了潜在应用场景。我们重点探讨了两个应用场景:在人类-AI咨询会话中监测基于认知模型的风险因素以检测不安全偏差,以及通过模拟患者对AI心理治疗师进行自动化基准测试。所提出的分类法为在AI驱动的心理健康支持领域建立更安全、更负责任的创新迈出了基础性一步。


An Empirical Study on Reinforcement Learning for Reasoning-Search Interleaved LLM Agents

Abstract

arXiv:2505.15117v1 Announce Type: cross Abstract: Reinforcement learning (RL) has demonstrated strong potential in training large language models (LLMs) capable of complex reasoning for real-world problem solving. More recently, RL has been leveraged to create sophisticated LLM-based search agents that adeptly combine reasoning with search engine use. While the use of RL for training search agents is promising, the optimal design of such agents remains not fully understood. In particular, key factors -- such as (1) reward formulation, (2) the choice and characteristics of the underlying LLM, and (3) the role of the search engine in the RL process -- require further investigation. In this work, we conduct comprehensive empirical studies to systematically investigate these and offer actionable insights. We highlight several key findings: format rewards are effective in improving final performance, whereas intermediate retrieval rewards have limited impact; the scale and initialization of the LLM (general-purpose vs. reasoning-specialized) significantly influence RL outcomes; and the choice of search engine plays a critical role in shaping RL training dynamics and the robustness of the trained agent during inference. These establish important guidelines for successfully building and deploying LLM-based search agents in real-world applications. Code is available at https://github.com/PeterGriffinJin/Search-R1.

摘要

强化学习(RL)在训练具备复杂推理能力的大型语言模型(LLMs)以解决现实问题方面展现出巨大潜力。近期,RL被进一步用于构建基于LLM的高级搜索代理,这些代理能巧妙地将推理与搜索引擎使用相结合。尽管利用RL训练搜索代理前景广阔,但其最优设计仍未完全明晰。尤其关键因素——包括(1)奖励机制设计,(2)底层LLM的选择与特性,以及(3)搜索引擎在RL过程中的作用——亟需深入探究。本研究通过全面实证分析系统考察了这些因素,并提出可操作的见解。我们揭示了若干重要发现:格式化奖励能有效提升最终性能,而中间检索奖励影响有限;LLM的规模与初始化方式(通用型与专用推理型)显著影响RL效果;搜索引擎的选择对RL训练动态及代理在推理阶段的鲁棒性具有关键作用。这些发现为实际应用中成功构建和部署基于LLM的搜索代理确立了重要准则。代码详见https://github.com/PeterGriffinJin/Search-R1。


Prolonged Reasoning Is Not All You Need: Certainty-Based Adaptive Routing for Efficient LLM/MLLM Reasoning

Abstract

arXiv:2505.15154v1 Announce Type: cross Abstract: Recent advancements in reasoning have significantly enhanced the capabilities of Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) across diverse tasks. However, excessive reliance on chain-of-thought (CoT) reasoning can impair model performance and brings unnecessarily lengthened outputs, reducing efficiency. Our work reveals that prolonged reasoning does not universally improve accuracy and even degrade performance on simpler tasks. To address this, we propose Certainty-based Adaptive Reasoning (CAR), a novel framework that dynamically switches between short answers and long-form reasoning based on the model perplexity. CAR first generates a short answer and evaluates its perplexity, triggering reasoning only when the model exhibits low confidence (i.e., high perplexity). Experiments across diverse multimodal VQA/KIE benchmarks and text reasoning datasets show that CAR outperforms both short-answer and long-form reasoning approaches, striking an optimal balance between accuracy and efficiency.

摘要

推理技术的最新进展显著提升了大型语言模型(LLMs)和多模态大型语言模型(MLLMs)在各类任务中的表现。然而,过度依赖思维链(CoT)推理会损害模型性能,并导致输出冗长,降低效率。我们的研究表明,延长推理过程并不能普遍提高准确性,甚至在简单任务上会降低性能。为此,我们提出基于确定性的自适应推理(CAR),这是一种新颖的框架,能够根据模型的困惑度动态切换简短答案和长式推理。CAR首先生成一个简短答案并评估其困惑度,仅当模型表现出低置信度(即高困惑度)时才会触发推理。在多种多模态VQA/KIE基准测试和文本推理数据集上的实验表明,CAR在准确性和效率之间实现了最佳平衡,其表现优于简短答案和长式推理方法。


The Unreasonable Effectiveness of Entropy Minimization in LLM Reasoning

Abstract

arXiv:2505.15134v1 Announce Type: cross Abstract: Entropy minimization (EM) trains the model to concentrate even more probability mass on its most confident outputs. We show that this simple objective alone, without any labeled data, can substantially improve large language models' (LLMs) performance on challenging math, physics, and coding tasks. We explore three approaches: (1) EM-FT minimizes token-level entropy similarly to instruction finetuning, but on unlabeled outputs drawn from the model; (2) EM-RL: reinforcement learning with negative entropy as the only reward to maximize; (3) EM-INF: inference-time logit adjustment to reduce entropy without any training data or parameter updates. On Qwen-7B, EM-RL, without any labeled data, achieves comparable or better performance than strong RL baselines such as GRPO and RLOO that are trained on 60K labeled examples. Furthermore, EM-INF enables Qwen-32B to match or exceed the performance of proprietary models like GPT-4o, Claude 3 Opus, and Gemini 1.5 Pro on the challenging SciCode benchmark, while being 3x more efficient than self-consistency and sequential refinement. Our findings reveal that many pretrained LLMs possess previously underappreciated reasoning capabilities that can be effectively elicited through entropy minimization alone, without any labeled data or even any parameter updates.

摘要

熵最小化(EM)通过训练模型使其在最具置信度的输出上进一步集中概率质量。本研究表明,仅凭这一简单目标(无需任何标注数据)即可显著提升大语言模型(LLM)在数学、物理和编程等挑战性任务上的表现。我们探索了三种方法:(1)EM-FT采用类似指令微调的方式最小化token级熵,但处理对象为模型自身生成的未标注输出;(2)EM-RL将负熵作为唯一奖励信号进行强化学习优化;(3)EM-INF通过推理阶段的对数调整降低熵值,无需训练数据或参数更新。在Qwen-7B上的实验表明,EM-RL在不使用任何标注数据的情况下,其性能可媲美甚至超越GRPO、RLOO等基于6万标注样本训练的强强化学习基线。此外,EM-INF使Qwen-32B在SciCode基准测试中达到或超越GPT-4o、Claude 3 Opus和Gemini 1.5 Pro等专有模型的表现,同时效率比自洽性和序列优化方法提升3倍。这些发现揭示:许多预训练LLM具有此前未被充分认识的推理能力,仅通过熵最小化(无需标注数据或参数更新)即可有效激发。


BanditSpec: Adaptive Speculative Decoding via Bandit Algorithms

Abstract

arXiv:2505.15141v1 Announce Type: cross Abstract: Speculative decoding has emerged as a popular method to accelerate the inference of Large Language Models (LLMs) while retaining their superior text generation performance. Previous methods either adopt a fixed speculative decoding configuration regardless of the prefix tokens, or train draft models in an offline or online manner to align them with the context. This paper proposes a training-free online learning framework to adaptively choose the configuration of the hyperparameters for speculative decoding as text is being generated. We first formulate this hyperparameter selection problem as a Multi-Armed Bandit problem and provide a general speculative decoding framework BanditSpec. Furthermore, two bandit-based hyperparameter selection algorithms, UCBSpec and EXP3Spec, are designed and analyzed in terms of a novel quantity, the stopping time regret. We upper bound this regret under both stochastic and adversarial reward settings. By deriving an information-theoretic impossibility result, it is shown that the regret performance of UCBSpec is optimal up to universal constants. Finally, extensive empirical experiments with LLaMA3 and Qwen2 demonstrate that our algorithms are effective compared to existing methods, and the throughput is close to the oracle best hyperparameter in simulated real-life LLM serving scenarios with diverse input prompts.

摘要

推测解码已成为加速大语言模型(LLM)推理同时保持其优异文本生成性能的流行方法。现有方法要么采用固定配置的推测解码(忽略前缀标记),要么通过离线或在线训练草稿模型以使其与上下文对齐。本文提出一种免训练的在线学习框架,能在文本生成过程中自适应选择推测解码的超参数配置。我们首先将该超参数选择问题建模为多臂老虎机问题,并提出通用推测解码框架BanditSpec。进一步设计了两种基于老虎机的超参数选择算法UCBSpec和EXP3Spec,并通过新定义的停止时间遗憾量进行分析。我们在随机奖励和对抗奖励两种设置下给出了该遗憾的上界。通过推导信息论不可能性结果,证明UCBSpec的遗憾性能在通用常数范围内达到最优。最后,基于LLaMA3和Qwen2的大量实验表明,相较于现有方法,我们的算法具有显著优势,在模拟真实LLM服务场景(包含多样化输入提示)中,其吞吐量接近最优超参数配置的预言机性能。


ReflAct: World-Grounded Decision Making in LLM Agents via Goal-State Reflection

Abstract

arXiv:2505.15182v1 Announce Type: cross Abstract: Recent advances in LLM agents have largely built on reasoning backbones like ReAct, which interleave thought and action in complex environments. However, ReAct often produces ungrounded or incoherent reasoning steps, leading to misalignment between the agent's actual state and goal. Our analysis finds that this stems from ReAct's inability to maintain consistent internal beliefs and goal alignment, causing compounding errors and hallucinations. To address this, we introduce ReflAct, a novel backbone that shifts reasoning from merely planning next actions to continuously reflecting on the agent's state relative to its goal. By explicitly grounding decisions in states and enforcing ongoing goal alignment, ReflAct dramatically improves strategic reliability. This design delivers substantial empirical gains: ReflAct surpasses ReAct by 27.7% on average, achieving a 93.3% success rate in ALFWorld. Notably, ReflAct even outperforms ReAct with added enhancement modules (e.g., Reflexion, WKM), showing that strengthening the core reasoning backbone is key to reliable agent performance.

摘要

大语言模型(LLM)智能体的最新进展主要建立在ReAct等推理框架上,该框架在复杂环境中交替执行思考与行动。然而,ReAct常产生缺乏依据或逻辑混乱的推理步骤,导致智能体实际状态与目标间出现偏差。我们的分析发现,这源于ReAct无法保持一致的内部信念与目标对齐,从而引发错误累积与幻觉。为此,我们提出ReflAct——一种新型框架,其推理过程从单纯规划下一步行动转向持续反思智能体当前状态与目标的匹配度。通过将决策显式地锚定于状态数据并强化持续目标对齐,ReflAct显著提升了策略可靠性。该设计带来实质性的实证提升:在ALFWorld环境中,ReflAct平均超越ReAct达27.7%,实现93.3%的成功率。值得注意的是,即便ReAct附加增强模块(如Reflexion、WKM),ReflAct仍保持优势,这表明强化核心推理框架才是提升智能体性能可靠性的关键。


AvatarShield: Visual Reinforcement Learning for Human-Centric Video Forgery Detection

Abstract

arXiv:2505.15173v1 Announce Type: cross Abstract: The rapid advancement of Artificial Intelligence Generated Content (AIGC) technologies, particularly in video generation, has led to unprecedented creative capabilities but also increased threats to information integrity, identity security, and public trust. Existing detection methods, while effective in general scenarios, lack robust solutions for human-centric videos, which pose greater risks due to their realism and potential for legal and ethical misuse. Moreover, current detection approaches often suffer from poor generalization, limited scalability, and reliance on labor-intensive supervised fine-tuning. To address these challenges, we propose AvatarShield, the first interpretable MLLM-based framework for detecting human-centric fake videos, enhanced via Group Relative Policy Optimization (GRPO). Through our carefully designed accuracy detection reward and temporal compensation reward, it effectively avoids the use of high-cost text annotation data, enabling precise temporal modeling and forgery detection. Meanwhile, we design a dual-encoder architecture, combining high-level semantic reasoning and low-level artifact amplification to guide MLLMs in effective forgery detection. We further collect FakeHumanVid, a large-scale human-centric video benchmark that includes synthesis methods guided by pose, audio, and text inputs, enabling rigorous evaluation of detection methods in real-world scenes. Extensive experiments show that AvatarShield significantly outperforms existing approaches in both in-domain and cross-domain detection, setting a new standard for human-centric video forensics.

摘要

人工智能生成内容(AIGC)技术的快速发展,特别是在视频生成领域,带来了前所未有的创作能力,同时也对信息完整性、身份安全和公众信任构成了日益严重的威胁。现有检测方法虽然在通用场景中有效,但缺乏针对以人物为中心视频的鲁棒解决方案——这类视频因其高度逼真性及潜在的伦理法律滥用风险而危害更大。此外,当前检测方法普遍存在泛化能力差、可扩展性有限以及依赖劳动密集型监督微调等问题。为解决这些挑战,我们提出AvatarShield框架,这是首个基于多模态大语言模型(MLLM)的可解释性人物虚假视频检测系统,通过群体相对策略优化(GRPO)进行增强。借助精心设计的精度检测奖励与时序补偿奖励机制,该框架有效避免了高成本文本标注数据的使用,实现了精确的时序建模与伪造检测。同时,我们设计了双编码器架构,结合高层语义推理与低层伪影放大技术,引导MLLM进行有效伪造检测。我们还构建了FakeHumanVid数据集——一个大规模以人物为中心的视频基准,包含姿势引导、音频引导和文本引导等多种合成方法,为真实场景下的检测方法评估提供严格标准。大量实验表明,AvatarShield在域内检测与跨域检测中均显著优于现有方法,为人物视频取证树立了新标杆。


R&D-Agent-Quant: A Multi-Agent Framework for Data-Centric Factors and Model Joint Optimization

Abstract

arXiv:2505.15155v1 Announce Type: cross Abstract: Financial markets pose fundamental challenges for asset return prediction due to their high dimensionality, non-stationarity, and persistent volatility. Despite advances in large language models and multi-agent systems, current quantitative research pipelines suffer from limited automation, weak interpretability, and fragmented coordination across key components such as factor mining and model innovation. In this paper, we propose R&D-Agent for Quantitative Finance, in short RD-Agent(Q), the first data-centric multi-agent framework designed to automate the full-stack research and development of quantitative strategies via coordinated factor-model co-optimization. RD-Agent(Q) decomposes the quant process into two iterative stages: a Research stage that dynamically sets goal-aligned prompts, formulates hypotheses based on domain priors, and maps them to concrete tasks, and a Development stage that employs a code-generation agent, Co-STEER, to implement task-specific code, which is then executed in real-market backtests. The two stages are connected through a feedback stage that thoroughly evaluates experimental outcomes and informs subsequent iterations, with a multi-armed bandit scheduler for adaptive direction selection. Empirically, RD-Agent(Q) achieves up to 2X higher annualized returns than classical factor libraries using 70% fewer factors, and outperforms state-of-the-art deep time-series models on real markets. Its joint factor-model optimization delivers a strong balance between predictive accuracy and strategy robustness. Our code is available at: https://github.com/microsoft/RD-Agent.

摘要

金融市场因其高维度、非平稳性和持续波动性,对资产回报预测提出了根本性挑战。尽管大型语言模型和多智能体系统取得了进展,当前量化研究流程仍存在自动化程度有限、可解释性薄弱以及因子挖掘与模型创新等关键环节协同不足的问题。本文提出量化金融研发智能体RD-Agent(Q),这是首个以数据为中心的多智能体框架,通过协调因子-模型联合优化实现量化策略全栈研发的自动化。该框架将量化流程分解为两个迭代阶段:研究阶段动态设定目标对齐提示,基于领域先验形成假设并将其映射为具体任务;开发阶段采用代码生成智能体Co-STEER实现任务专属代码,并在真实市场中进行回测执行。两阶段通过反馈环节连接,该环节全面评估实验结果并指导后续迭代,配合多臂老虎机调度器实现自适应方向选择。实证表明,RD-Agent(Q)在使用因子数量减少70%的情况下,年化收益最高可达传统因子库的2倍,并在真实市场中优于最先进的深度时间序列模型。其因子-模型联合优化机制在预测准确性与策略稳健性之间实现了良好平衡。代码已开源:https://github.com/microsoft/RD-Agent。


Towards Explainable Temporal Reasoning in Large Language Models: A Structure-Aware Generative Framework

Abstract

arXiv:2505.15245v1 Announce Type: cross Abstract: While large language models (LLMs) show great potential in temporal reasoning, most existing work focuses heavily on enhancing performance, often neglecting the explainable reasoning processes underlying the results. To address this gap, we introduce a comprehensive benchmark covering a wide range of temporal granularities, designed to systematically evaluate LLMs' capabilities in explainable temporal reasoning. Furthermore, our findings reveal that LLMs struggle to deliver convincing explanations when relying solely on textual information. To address challenge, we propose GETER, a novel structure-aware generative framework that integrates Graph structures with text for Explainable TEmporal Reasoning. Specifically, we first leverage temporal knowledge graphs to develop a temporal encoder that captures structural information for the query. Subsequently, we introduce a structure-text prefix adapter to map graph structure features into the text embedding space. Finally, LLMs generate explanation text by seamlessly integrating the soft graph token with instruction-tuning prompt tokens. Experimental results indicate that GETER achieves state-of-the-art performance while also demonstrating its effectiveness as well as strong generalization capabilities. Our dataset and code are available at https://github.com/carryTatum/GETER.

摘要

虽然大语言模型(LLM)在时序推理方面展现出巨大潜力,但现有研究大多集中于性能提升,往往忽视了对结果背后可解释推理过程的探究。为填补这一空白,我们提出了一个覆盖多粒度时序场景的综合性基准,旨在系统评估LLM在可解释时序推理方面的能力。研究发现,仅依赖文本信息时,LLM难以提供令人信服的解释。针对这一挑战,我们提出了GETER——一种新颖的结构感知生成框架,通过将图结构与文本相结合来实现可解释时序推理。具体而言,我们首先利用时序知识图谱构建时序编码器以捕获查询的结构化信息;随后引入结构-文本前缀适配器,将图结构特征映射至文本嵌入空间;最终LLM通过无缝融合软图标记与指令调优提示标记来生成解释文本。实验结果表明,GETER不仅实现了最先进的性能,同时展现出卓越的有效性与强泛化能力。我们的数据集与代码已开源:https://github.com/carryTatum/GETER。


Blind Spot Navigation: Evolutionary Discovery of Sensitive Semantic Concepts for LVLMs

Abstract

arXiv:2505.15265v1 Announce Type: cross Abstract: Adversarial attacks aim to generate malicious inputs that mislead deep models, but beyond causing model failure, they cannot provide certain interpretable information such as \textit&#123;What content in inputs make models more likely to fail?&#125;'' However, this information is crucial for researchers to specifically improve model robustness. Recent research suggests that models may be particularly sensitive to certain semantics in visual inputs (such as wet,'' ``foggy''), making them prone to errors. Inspired by this, in this paper we conducted the first exploration on large vision-language models (LVLMs) and found that LVLMs indeed are susceptible to hallucinations and various errors when facing specific semantic concepts in images. To efficiently search for these sensitive concepts, we integrated large language models (LLMs) and text-to-image (T2I) models to propose a novel semantic evolution framework. Randomly initialized semantic concepts undergo LLM-based crossover and mutation operations to form image descriptions, which are then converted by T2I models into visual inputs for LVLMs. The task-specific performance of LVLMs on each input is quantified as fitness scores for the involved semantics and serves as reward signals to further guide LLMs in exploring concepts that induce LVLMs. Extensive experiments on seven mainstream LVLMs and two multimodal tasks demonstrate the effectiveness of our method. Additionally, we provide interesting findings about the sensitive semantics of LVLMs, aiming to inspire further in-depth research.

摘要

对抗攻击旨在生成误导深度模型的恶意输入,但除了导致模型失效外,其无法提供诸如“输入中哪些内容更易导致模型失败?”等可解释信息。然而这些信息对研究者针对性提升模型鲁棒性至关重要。最新研究表明,视觉输入中的特定语义(如“潮湿”“雾天”)可能使模型产生异常敏感并诱发错误。受此启发,本文针对大视觉语言模型(LVLM)展开首次探索,发现当图像包含特定语义概念时,LVLM确实易产生幻觉与各类错误。为高效搜寻这些敏感概念,我们融合大语言模型(LLM)与文生图(T2I)模型,提出新型语义演化框架:随机初始化的语义概念经过基于LLM的交叉变异操作形成图像描述,再由T2I模型转换为视觉输入供LVLM处理。LVLM在各输入上的任务表现被量化为相关语义的适应度分数,并作为奖励信号进一步引导LLM探索诱发LVLM失效的概念。在七种主流LVLM和两项多模态任务上的大量实验验证了本方法的有效性。此外,我们揭示了关于LVLM敏感语义的有趣发现,以期启发更深入的后续研究。


Adaptive Plan-Execute Framework for Smart Contract Security Auditing

Abstract

arXiv:2505.15242v1 Announce Type: cross Abstract: Large Language Models (LLMs) have shown great promise in code analysis and auditing; however, they still struggle with hallucinations and limited context-aware reasoning. We introduce SmartAuditFlow, a novel Plan-Execute framework that enhances smart contract security analysis through dynamic audit planning and structured execution. Unlike conventional LLM-based auditing approaches that follow fixed workflows and predefined steps, SmartAuditFlow dynamically generates and refines audit plans based on the unique characteristics of each smart contract. It continuously adjusts its auditing strategy in response to intermediate LLM outputs and newly detected vulnerabilities, ensuring a more adaptive and precise security assessment. The framework then executes these plans step by step, applying a structured reasoning process to enhance vulnerability detection accuracy while minimizing hallucinations and false positives. To further improve audit precision, SmartAuditFlow integrates iterative prompt optimization and external knowledge sources, such as static analysis tools and Retrieval-Augmented Generation (RAG). This ensures audit decisions are contextually informed and backed by real-world security knowledge, producing comprehensive security reports. Extensive evaluations across multiple benchmarks demonstrate that SmartAuditFlow outperforms existing methods, achieving 100 percent accuracy on common and critical vulnerabilities, 41.2 percent accuracy for comprehensive coverage of known smart contract weaknesses in real-world projects, and successfully identifying all 13 tested CVEs. These results highlight SmartAuditFlow's scalability, cost-effectiveness, and superior adaptability over traditional static analysis tools and contemporary LLM-based approaches, establishing it as a robust solution for automated smart contract auditing.

摘要

大语言模型(LLMs)在代码分析与审计中展现出巨大潜力,但仍存在幻觉问题和上下文感知推理能力不足的局限。本文提出SmartAuditFlow——一种通过动态审计规划与结构化执行来增强智能合约安全分析的新型"计划-执行"框架。与基于固定工作流和预定义步骤的传统LLM审计方法不同,SmartAuditFlow能根据每个智能合约的独特特征动态生成并优化审计方案。该框架会持续根据LLM的中间输出与新发现的漏洞调整审计策略,确保安全评估更具适应性和精确性。随后通过结构化推理流程逐步执行审计方案,在提升漏洞检测准确率的同时最大限度减少幻觉误报。为进一步提高审计精度,SmartAuditFlow整合了迭代式提示优化与外部知识源(如静态分析工具和检索增强生成技术),确保审计决策具有上下文依据并符合现实安全知识,最终生成全面安全报告。跨多基准的广泛实验表明,SmartAuditFlow在常见关键漏洞检测中达到100%准确率,对真实项目已知智能合约弱点的综合覆盖准确率达41.2%,并能成功识别全部13个测试CVE,其性能显著优于现有方法。这些结果证明该框架在可扩展性、成本效益和适应性方面均优于传统静态分析工具与当代基于LLM的方案,为自动化智能合约审计提供了可靠解决方案。


Accelerating Autoregressive Speech Synthesis Inference With Speech Speculative Decoding

Abstract

arXiv:2505.15380v1 Announce Type: cross Abstract: Modern autoregressive speech synthesis models leveraging language models have demonstrated remarkable performance. However, the sequential nature of next token prediction in these models leads to significant latency, hindering their deployment in scenarios where inference speed is critical. In this work, we propose Speech Speculative Decoding (SSD), a novel framework for autoregressive speech synthesis acceleration. Specifically, our method employs a lightweight draft model to generate candidate token sequences, which are subsequently verified in parallel by the target model using the proposed SSD framework. Experimental results demonstrate that SSD achieves a significant speedup of 1.4x compared with conventional autoregressive decoding, while maintaining high fidelity and naturalness. Subjective evaluations further validate the effectiveness of SSD in preserving the perceptual quality of the target model while accelerating inference.

摘要

现代基于语言模型的自回归语音合成模型已展现出卓越性能。然而,这些模型采用逐令牌预测的序列特性会导致显著延迟,在推理速度关键的应用场景中难以部署。本研究提出语音推测解码(SSD)框架,一种创新的自回归语音合成加速方法。具体而言,该方法采用轻量级草稿模型生成候选令牌序列,随后通过目标模型在SSD框架下进行并行验证。实验结果表明,与传统自回归解码相比,SSD在保持高保真度和自然度的同时实现了1.4倍的显著加速。主观评估进一步证实,SSD在加速推理的同时能有效保持目标模型的感知质量。


Trajectory Bellman Residual Minimization: A Simple Value-Based Method for LLM Reasoning

Abstract

arXiv:2505.15311v1 Announce Type: cross Abstract: Policy-based methods currently dominate reinforcement learning (RL) pipelines for large language model (LLM) reasoning, leaving value-based approaches largely unexplored. We revisit the classical paradigm of Bellman Residual Minimization and introduce Trajectory Bellman Residual Minimization (TBRM), an algorithm that naturally adapts this idea to LLMs, yielding a simple yet effective off-policy algorithm that optimizes a single trajectory-level Bellman objective using the model's own logits as QQ-values. TBRM removes the need for critics, importance-sampling ratios, or clipping, and operates with only one rollout per prompt. We prove convergence to the near-optimal KL-regularized policy from arbitrary off-policy data via an improved change-of-trajectory-measure analysis. Experiments on standard mathematical-reasoning benchmarks show that TBRM consistently outperforms policy-based baselines, like PPO and GRPO, with comparable or lower computational and memory overhead. Our results indicate that value-based RL might be a principled and efficient alternative for enhancing reasoning capabilities in LLMs.

摘要

当前基于策略的方法主导着大型语言模型(LLM)推理的强化学习(RL)流程,而基于价值的方法则鲜少被探索。我们重新审视贝尔曼残差最小化这一经典范式,提出了轨迹贝尔曼残差最小化(TBRM)算法,该算法将这一思想自然适配至LLM,产生了一种简单而有效的离策略算法。该算法通过使用模型自身logits作为QQ值,优化单一轨迹级贝尔曼目标。TBRM无需批评器、重要性采样比率或截断操作,且每个提示仅需一次 rollout。通过改进的轨迹测度变换分析,我们证明了该方法能从任意离策略数据收敛至接近最优的KL正则化策略。在标准数学推理基准测试上的实验表明,TBRM在计算和内存开销相当或更低的情况下,持续优于PPO、GRPO等基于策略的基线方法。我们的结果表明,基于价值的RL可能是提升LLM推理能力的一种原则性且高效的替代方案。


LLM-Explorer: A Plug-in Reinforcement Learning Policy Exploration Enhancement Driven by Large Language Models

Abstract

arXiv:2505.15293v1 Announce Type: cross Abstract: Policy exploration is critical in reinforcement learning (RL), where existing approaches include greedy, Gaussian process, etc. However, these approaches utilize preset stochastic processes and are indiscriminately applied in all kinds of RL tasks without considering task-specific features that influence policy exploration. Moreover, during RL training, the evolution of such stochastic processes is rigid, which typically only incorporates a decay in the variance, failing to adjust flexibly according to the agent's real-time learning status. Inspired by the analyzing and reasoning capability of large language models (LLMs), we design LLM-Explorer to adaptively generate task-specific exploration strategies with LLMs, enhancing the policy exploration in RL. In our design, we sample the learning trajectory of the agent during the RL training in a given task and prompt the LLM to analyze the agent's current policy learning status and then generate a probability distribution for future policy exploration. Updating the probability distribution periodically, we derive a stochastic process specialized for the particular task and dynamically adjusted to adapt to the learning process. Our design is a plug-in module compatible with various widely applied RL algorithms, including the DQN series, DDPG, TD3, and any possible variants developed based on them. Through extensive experiments on the Atari and MuJoCo benchmarks, we demonstrate LLM-Explorer's capability to enhance RL policy exploration, achieving an average performance improvement up to 37.27%. Our code is open-source at https://anonymous.4open.science/r/LLM-Explorer-19BE for reproducibility.

摘要

策略探索在强化学习(RL)中至关重要,现有方法包括贪婪策略、高斯过程等。然而,这些方法采用预设的随机过程,且不加区分地应用于各类RL任务,未考虑影响策略探索的任务特定特征。此外,在RL训练过程中,此类随机过程的演化机制僵化,通常仅包含方差衰减,无法根据智能体的实时学习状态灵活调整。受大型语言模型(LLMs)分析推理能力的启发,我们设计LLM-Explorer,利用LLMs自适应生成任务专属的探索策略以增强RL策略探索。该设计通过采样智能体在给定任务中的学习轨迹,提示LLM分析当前策略学习状态,进而生成用于未来策略探索的概率分布。通过周期性更新该概率分布,我们构建出专用于特定任务、并能动态适应学习过程的随机过程。本设计为插件式模块,兼容DQN系列、DDPG、TD3等广泛应用的RL算法及其衍生变体。基于Atari和MuJoCo基准的广泛实验表明,LLM-Explorer可提升RL策略探索能力,平均性能最高提升达37.27%。代码已开源于https://anonymous.4open.science/r/LLM-Explorer-19BE以确保可复现性。


Your Language Model Can Secretly Write Like Humans: Contrastive Paraphrase Attacks on LLM-Generated Text Detectors

Abstract

arXiv:2505.15337v1 Announce Type: cross Abstract: The misuse of large language models (LLMs), such as academic plagiarism, has driven the development of detectors to identify LLM-generated texts. To bypass these detectors, paraphrase attacks have emerged to purposely rewrite these texts to evade detection. Despite the success, existing methods require substantial data and computational budgets to train a specialized paraphraser, and their attack efficacy greatly reduces when faced with advanced detection algorithms. To address this, we propose \textbf{Co}ntrastive \textbf{P}araphrase \textbf{A}ttack (CoPA), a training-free method that effectively deceives text detectors using off-the-shelf LLMs. The first step is to carefully craft instructions that encourage LLMs to produce more human-like texts. Nonetheless, we observe that the inherent statistical biases of LLMs can still result in some generated texts carrying certain machine-like attributes that can be captured by detectors. To overcome this, CoPA constructs an auxiliary machine-like word distribution as a contrast to the human-like distribution generated by the LLM. By subtracting the machine-like patterns from the human-like distribution during the decoding process, CoPA is able to produce sentences that are less discernible by text detectors. Our theoretical analysis suggests the superiority of the proposed attack. Extensive experiments validate the effectiveness of CoPA in fooling text detectors across various scenarios.

摘要

大型语言模型(LLMs)的滥用(如学术抄袭)推动了检测器的发展以识别LLM生成的文本。为规避这些检测器,改述攻击应运而生,其通过刻意重写文本来逃避检测。尽管现有方法取得了一定成效,但它们需要大量数据和计算资源来训练专用改述器,且在面对先进检测算法时攻击效果大幅下降。为此,我们提出对比式改述攻击(CoPA),这是一种无需训练的免训练方法,可利用现成LLMs有效欺骗文本检测器。首先需精心设计指令,促使LLMs生成更接近人类书写的文本。然而我们观察到,LLMs固有的统计偏差仍会导致部分生成文本携带可被检测器捕获的机器特征。为解决此问题,CoPA构建了一个辅助的机器化词分布,与LLM生成的人类化分布形成对比。通过在解码过程中从人类化分布中减去机器化模式,CoPA能生成不易被文本检测器识别的句子。理论分析表明该攻击方法具有优越性。大量实验验证了CoPA在不同场景下欺骗文本检测器的有效性。


Multiple Weaks Win Single Strong: Large Language Models Ensemble Weak Reinforcement Learning Agents into a Supreme One

Abstract

arXiv:2505.15306v1 Announce Type: cross Abstract: Model ensemble is a useful approach in reinforcement learning (RL) for training effective agents. Despite wide success of RL, training effective agents remains difficult due to the multitude of factors requiring careful tuning, such as algorithm selection, hyperparameter settings, and even random seed choices, all of which can significantly influence an agent's performance. Model ensemble helps overcome this challenge by combining multiple weak agents into a single, more powerful one, enhancing overall performance. However, existing ensemble methods, such as majority voting and Boltzmann addition, are designed as fixed strategies and lack a semantic understanding of specific tasks, limiting their adaptability and effectiveness. To address this, we propose LLM-Ens, a novel approach that enhances RL model ensemble with task-specific semantic understandings driven by large language models (LLMs). Given a task, we first design an LLM to categorize states in this task into distinct 'situations', incorporating high-level descriptions of the task conditions. Then, we statistically analyze the strengths and weaknesses of each individual agent to be used in the ensemble in each situation. During the inference time, LLM-Ens dynamically identifies the changing task situation and switches to the agent that performs best in the current situation, ensuring dynamic model selection in the evolving task condition. Our approach is designed to be compatible with agents trained with different random seeds, hyperparameter settings, and various RL algorithms. Extensive experiments on the Atari benchmark show that LLM-Ens significantly improves the RL model ensemble, surpassing well-known baselines by up to 20.9%. For reproducibility, our code is open-source at https://anonymous.4open.science/r/LLM4RLensemble-F7EE.

摘要

模型集成是强化学习(RL)中训练高效智能体的有效方法。尽管强化学习已取得广泛成功,但由于算法选择、超参数设置甚至随机种子选择等多重因素需精细调校,训练高效智能体仍具挑战性,这些因素均可能显著影响智能体性能。模型集成通过将多个弱智能体组合成单一更强智能体来克服这一挑战,从而提升整体性能。然而现有集成方法(如多数投票和玻尔兹曼加法)采用固定策略,缺乏对特定任务的语义理解,限制了其适应性与有效性。为此,我们提出LLM-Ens——一种通过大型语言模型(LLMs)驱动的任务语义理解来增强RL模型集成的新方法。给定任务时,我们首先设计LLM将该任务中的状态划分为不同"情境",并整合任务条件的高层描述;随后统计分析集成中每个智能体在各情境下的优劣势。在推理阶段,LLM-Ens动态识别任务情境变化并切换至当前情境下表现最优的智能体,确保在动态任务条件下实现模型选择的适应性。该方法兼容不同随机种子、超参数设置及各类RL算法训练的智能体。Atari基准测试表明,LLM-Ens显著提升RL模型集成效果,以最高20.9%的优势超越知名基线方法。为保障可复现性,代码已开源:https://anonymous.4open.science/r/LLM4RLensemble-F7EE。


RePPL: Recalibrating Perplexity by Uncertainty in Semantic Propagation and Language Generation for Explainable QA Hallucination Detection

Abstract

arXiv:2505.15386v1 Announce Type: cross Abstract: Large Language Models (LLMs) have become powerful, but hallucinations remain a vital obstacle to their trustworthy use. While previous works improved the capability of hallucination detection by measuring uncertainty, they all lack the ability to explain the provenance behind why hallucinations occur, i.e., which part of the inputs tends to trigger hallucinations. Recent works on the prompt attack indicate that uncertainty exists in semantic propagation, where attention mechanisms gradually fuse local token information into high-level semantics across layers. Meanwhile, uncertainty also emerges in language generation, due to its probability-based selection of high-level semantics for sampled generations. Based on that, we propose RePPL to recalibrate uncertainty measurement by these two aspects, which dispatches explainable uncertainty scores to each token and aggregates in Perplexity-style Log-Average form as total score. Experiments show that our method achieves the best comprehensive detection performance across various QA datasets on advanced models (average AUC of 0.833), and our method is capable of producing token-level uncertainty scores as explanations for the hallucination. Leveraging these scores, we preliminarily find the chaotic pattern of hallucination and showcase its promising usage.

摘要

大型语言模型(LLMs)已展现出强大能力,但幻觉问题仍是影响其可信应用的关键障碍。尽管先前研究通过不确定性度量提升了幻觉检测能力,但这些方法均无法解释幻觉产生的根源,即输入中哪些部分容易引发幻觉。近期关于提示攻击的研究表明,语义传播过程中存在不确定性——注意力机制在各层中逐步将局部词元信息融合为高层语义;同时,语言生成过程因其基于概率的高层语义采样选择也会产生不确定性。基于此,我们提出RePPL方法,通过这两方面重新校准不确定性度量:为每个词元分配可解释的不确定性分数,并以困惑度式对数平均形式聚合为总分。实验表明,本方法在先进模型的各种QA数据集上实现了最佳综合检测性能(平均AUC达0.833),并能生成词元级不确定性分数作为幻觉解释。利用这些分数,我们初步发现了幻觉的混沌模式,并展示了其潜在应用价值。


Single LLM, Multiple Roles: A Unified Retrieval-Augmented Generation Framework Using Role-Specific Token Optimization

Abstract

arXiv:2505.15444v1 Announce Type: cross Abstract: Existing studies have optimized retrieval-augmented generation (RAG) across various sub-tasks, such as query understanding and retrieval refinement, but integrating these optimizations into a unified framework remains challenging. To tackle this problem, this work proposes RoleRAG, a unified RAG framework that achieves efficient multi-task processing through role-specific token optimization. RoleRAG comprises six modules, each handling a specific sub-task within the RAG process. Additionally, we introduce a query graph to represent the decomposition of the query, which can be dynamically resolved according to the decomposing state. All modules are driven by the same underlying LLM, distinguished by task-specific role tokens that are individually optimized. This design allows RoleRAG to dynamically activate different modules within a single LLM instance, thereby streamlining deployment and reducing resource consumption. Experimental results on five open-domain question-answering datasets demonstrate the effectiveness, generalizability, and flexibility of our framework.

摘要

现有研究已在查询理解和检索优化等多个子任务上对检索增强生成(RAG)进行了优化,但将这些优化整合到统一框架中仍具挑战性。为解决这一问题,本研究提出RoleRAG——一个通过角色特定令牌优化实现高效多任务处理的统一RAG框架。该框架包含六个模块,分别处理RAG流程中的特定子任务。此外,我们引入查询图来表示查询的分解状态,该图可根据分解状态动态解析。所有模块由同一底层大语言模型驱动,通过单独优化的任务特定角色令牌进行区分。这种设计使得RoleRAG能在单一LLM实例中动态激活不同模块,从而简化部署并降低资源消耗。在五个开放域问答数据集上的实验结果验证了本框架的有效性、泛化性和灵活性。


Silent Leaks: Implicit Knowledge Extraction Attack on RAG Systems through Benign Queries

Abstract

arXiv:2505.15420v1 Announce Type: cross Abstract: Retrieval-Augmented Generation (RAG) systems enhance large language models (LLMs) by incorporating external knowledge bases, but they are vulnerable to privacy risks from data extraction attacks. Existing extraction methods typically rely on malicious inputs such as prompt injection or jailbreaking, making them easily detectable via input- or output-level detection. In this paper, we introduce Implicit Knowledge Extraction Attack (IKEA), which conducts knowledge extraction on RAG systems through benign queries. IKEA first leverages anchor concepts to generate queries with the natural appearance, and then designs two mechanisms to lead to anchor concept thoroughly 'explore' the RAG's privacy knowledge: (1) Experience Reflection Sampling, which samples anchor concepts based on past query-response patterns to ensure the queries' relevance to RAG documents; (2) Trust Region Directed Mutation, which iteratively mutates anchor concepts under similarity constraints to further exploit the embedding space. Extensive experiments demonstrate IKEA's effectiveness under various defenses, surpassing baselines by over 80% in extraction efficiency and 90% in attack success rate. Moreover, the substitute RAG system built from IKEA's extractions consistently outperforms those based on baseline methods across multiple evaluation tasks, underscoring the significant privacy risk in RAG systems.

摘要

检索增强生成(RAG)系统通过整合外部知识库来增强大语言模型(LLM)的性能,但这类系统易受数据提取攻击引发的隐私风险威胁。现有提取方法通常依赖提示注入或越狱等恶意输入,使得其可通过输入或输出层面的检测机制轻易识别。本文提出隐式知识提取攻击(IKEA),该方法通过良性查询实现对RAG系统的知识提取。IKEA首先利用锚点概念生成具有自然语义特征的查询语句,随后设计两种机制引导锚点概念对RAG隐私知识进行彻底"探索":(1)经验反射采样,基于历史查询-响应模式筛选锚点概念,确保查询与RAG文档的相关性;(2)信任域定向变异,在相似性约束下迭代变异锚点概念以深度挖掘嵌入空间。大量实验表明,IKEA在多种防御机制下均表现优异,其提取效率较基线方法提升超80%,攻击成功率提高逾90%。此外,基于IKEA提取结果构建的替代RAG系统在多项评估任务中持续优于基线方法构建的系统,这凸显了RAG系统存在的重大隐私风险。


Set-LLM: A Permutation-Invariant LLM

Abstract

arXiv:2505.15433v1 Announce Type: cross Abstract: While large language models (LLMs) demonstrate impressive capabilities across numerous applications, their robustness remains a critical concern. This paper is motivated by a specific vulnerability: the order sensitivity of LLMs. This vulnerability manifests itself as the order bias observed when LLMs decide between possible options (for example, a preference for the first option) and the tendency of LLMs to provide different answers when options are reordered. The use cases for this scenario extend beyond the classical case of multiple-choice question answering to the use of LLMs as automated evaluators in AI pipelines, comparing output generated by different models. We introduce Set-LLM, a novel architectural adaptation for pretrained LLMs that enables the processing of mixed set-text inputs with permutation invariance guarantees. The adaptations involve a new attention mask and new positional encodings specifically designed for sets. We provide a theoretical proof of invariance and demonstrate through experiments that Set-LLM can be trained effectively, achieving comparable or improved performance and maintaining the runtime of the original model, while eliminating order sensitivity.

摘要

尽管大语言模型(LLMs)在众多应用中展现出卓越的能力,但其鲁棒性仍是关键问题。本文针对一个特定漏洞展开研究:LLMs的顺序敏感性。该漏洞表现为两种现象——当LLMs在多个选项间进行决策时(例如对首个选项的偏好),以及选项重新排序时模型给出不同答案的倾向。这种场景的应用范围超越传统的多选题回答场景,延伸至LLMs作为AI流程中的自动评估器、比较不同模型输出的场景。我们提出Set-LLM,一种针对预训练LLMs的新型架构适配方案,可处理具有排列不变性保证的混合集合-文本输入。该方案包含专门为集合设计的新型注意力掩码和位置编码。我们提供了不变性的理论证明,并通过实验验证Set-LLM能有效训练,在保持原始模型运行效率的同时获得相当或更优的性能,且完全消除了顺序敏感性。


Joint Flashback Adaptation for Forgetting-Resistant Instruction Tuning

Abstract

arXiv:2505.15467v1 Announce Type: cross Abstract: Large language models have achieved remarkable success in various tasks. However, it is challenging for them to learn new tasks incrementally due to catastrophic forgetting. Existing approaches rely on experience replay, optimization constraints, or task differentiation, which encounter strict limitations in real-world scenarios. To address these issues, we propose Joint Flashback Adaptation. We first introduce flashbacks -- a limited number of prompts from old tasks -- when adapting to new tasks and constrain the deviations of the model outputs compared to the original one. We then interpolate latent tasks between flashbacks and new tasks to enable jointly learning relevant latent tasks, new tasks, and flashbacks, alleviating data sparsity in flashbacks and facilitating knowledge sharing for smooth adaptation. Our method requires only a limited number of flashbacks without access to the replay data and is task-agnostic. We conduct extensive experiments on state-of-the-art large language models across 1000+ instruction-following tasks, arithmetic reasoning tasks, and general reasoning tasks. The results demonstrate the superior performance of our method in improving generalization on new tasks and reducing forgetting in old tasks.

摘要

大型语言模型在各种任务中取得了显著成功。然而由于灾难性遗忘问题,它们难以实现新任务的增量学习。现有方法依赖于经验回放、优化约束或任务区分策略,这些方法在现实场景中存在严格局限。为解决这些问题,我们提出联合回溯适应方法。我们首先在适应新任务时引入回溯机制——即从旧任务中提取有限数量的提示样本——并约束模型输出相对于原始版本的偏离程度。随后通过在回溯样本与新任务之间插值潜在任务,实现相关潜在任务、新任务与回溯样本的联合学习,从而缓解回溯样本的数据稀疏性问题,促进知识共享以实现平滑适应。本方法仅需少量回溯样本且无需访问回放数据,同时具有任务无关性。我们在最先进的大型语言模型上进行了广泛实验,涵盖1000+指令跟随任务、算术推理任务和通用推理任务。结果表明,该方法在提升新任务泛化能力和减少旧任务遗忘方面具有卓越性能。


Audio Jailbreak: An Open Comprehensive Benchmark for Jailbreaking Large Audio-Language Models

Abstract

arXiv:2505.15406v1 Announce Type: cross Abstract: The rise of Large Audio Language Models (LAMs) brings both potential and risks, as their audio outputs may contain harmful or unethical content. However, current research lacks a systematic, quantitative evaluation of LAM safety especially against jailbreak attacks, which are challenging due to the temporal and semantic nature of speech. To bridge this gap, we introduce AJailBench, the first benchmark specifically designed to evaluate jailbreak vulnerabilities in LAMs. We begin by constructing AJailBench-Base, a dataset of 1,495 adversarial audio prompts spanning 10 policy-violating categories, converted from textual jailbreak attacks using realistic text to speech synthesis. Using this dataset, we evaluate several state-of-the-art LAMs and reveal that none exhibit consistent robustness across attacks. To further strengthen jailbreak testing and simulate more realistic attack conditions, we propose a method to generate dynamic adversarial variants. Our Audio Perturbation Toolkit (APT) applies targeted distortions across time, frequency, and amplitude domains. To preserve the original jailbreak intent, we enforce a semantic consistency constraint and employ Bayesian optimization to efficiently search for perturbations that are both subtle and highly effective. This results in AJailBench-APT, an extended dataset of optimized adversarial audio samples. Our findings demonstrate that even small, semantically preserved perturbations can significantly reduce the safety performance of leading LAMs, underscoring the need for more robust and semantically aware defense mechanisms.

摘要

大型音频语言模型(LAMs)的兴起既带来潜力也伴随风险,其音频输出可能包含有害或不道德内容。然而当前研究缺乏对LAM安全性的系统化定量评估,尤其在对抗越狱攻击方面——由于语音的时序性和语义特性,这类攻击具有特殊挑战性。为填补这一空白,我们提出首个专门评估LAM越狱漏洞的基准AJailBench。首先构建包含1,495个对抗性音频提示的AJailBench-Base数据集,涵盖10类违规场景,这些数据通过真实文本转语音技术从文本越狱攻击转换而来。基于该数据集对多个前沿LAM进行评估,发现所有模型均未表现出跨攻击的持续鲁棒性。为增强越狱测试并模拟更真实的攻击条件,我们提出动态对抗变体生成方法:通过自主研发的音频扰动工具包(APT),在时域、频域和幅值域实施针对性失真处理。为保持原始越狱意图,采用语义一致性约束条件,并利用贝叶斯优化高效搜索兼具隐蔽性与高效性的扰动方案,最终形成扩展数据集AJailBench-APT。实验表明,即使微小但语义保持的扰动也能显著降低主流LAMs的安全性能,这凸显了开发更具鲁棒性且语义感知防御机制的必要性。


Protoknowledge Shapes Behaviour of LLMs in Downstream Tasks: Memorization and Generalization with Knowledge Graphs

Abstract

arXiv:2505.15501v1 Announce Type: cross Abstract: We introduce the concept of protoknowledge to formalize and measure how sequences of tokens encoding Knowledge Graphs are internalized during pretraining and utilized at inference time by Large Language Models (LLMs). Indeed, LLMs have demonstrated the ability to memorize vast amounts of token sequences during pretraining, and a central open question is how they leverage this memorization as reusable knowledge through generalization. We then categorize protoknowledge into lexical, hierarchical, and topological forms, varying on the type of knowledge that needs to be activated. We measure protoknowledge through Knowledge Activation Tasks (KATs), analyzing its general properties such as semantic bias. We then investigate the impact of protoknowledge on Text-to-SPARQL performance by varying prompting strategies depending on input conditions. To this end, we adopt a novel analysis framework that assesses whether model predictions align with the successful activation of the relevant protoknowledge for each query. This methodology provides a practical tool to explore Semantic-Level Data Contamination and serves as an effective strategy for Closed-Pretraining models.

摘要

我们提出"原知识"概念,用以形式化衡量大规模语言模型(LLMs)在预训练过程中如何内化编码知识图谱的token序列,并在推理时加以利用。事实上,LLMs已展现出在预训练阶段记忆海量token序列的能力,而核心问题在于它们如何通过泛化将这种记忆转化为可复用的知识。我们将原知识分为词汇型、层级型和拓扑型三类,其差异在于所需激活的知识类型。通过知识激活任务(KATs)测量原知识,分析其语义偏差等基本特性。随后,我们通过根据输入条件调整提示策略,研究原知识对文本到SPARQL转换性能的影响。为此,我们采用新型分析框架来评估模型预测是否与每个查询相关原知识的成功激活相一致。该方法为探索语义级数据污染提供了实用工具,同时构成封闭预训练模型的有效策略。


ViaRL: Adaptive Temporal Grounding via Visual Iterated Amplification Reinforcement Learning

Abstract

arXiv:2505.15447v1 Announce Type: cross Abstract: Video understanding is inherently intention-driven-humans naturally focus on relevant frames based on their goals. Recent advancements in multimodal large language models (MLLMs) have enabled flexible query-driven reasoning; however, video-based frameworks like Video Chain-of-Thought lack direct training signals to effectively identify relevant frames. Current approaches often rely on heuristic methods or pseudo-label supervised annotations, which are both costly and limited in scalability across diverse scenarios. To overcome these challenges, we introduce ViaRL, the first framework to leverage rule-based reinforcement learning (RL) for optimizing frame selection in intention-driven video understanding. An iterated amplification strategy is adopted to perform alternating cyclic training in the video CoT system, where each component undergoes iterative cycles of refinement to improve its capabilities. ViaRL utilizes the answer accuracy of a downstream model as a reward signal to train a frame selector through trial-and-error, eliminating the need for expensive annotations while closely aligning with human-like learning processes. Comprehensive experiments across multiple benchmarks, including VideoMME, LVBench, and MLVU, demonstrate that ViaRL consistently delivers superior temporal grounding performance and robust generalization across diverse video understanding tasks, highlighting its effectiveness and scalability. Notably, ViaRL achieves a nearly 15% improvement on Needle QA, a subset of MLVU, which is required to search a specific needle within a long video and regarded as one of the most suitable benchmarks for evaluating temporal grounding.

摘要

视频理解本质上是意图驱动的——人类会自然地根据目标聚焦相关帧。尽管多模态大语言模型(MLLMs)的最新进展已实现灵活的查询驱动推理,但诸如视频思维链(Video Chain-of-Thought)等框架缺乏直接训练信号来有效识别相关帧。现有方法通常依赖启发式方法或伪标签监督标注,这些方法不仅成本高昂,且在多场景下的可扩展性有限。为克服这些挑战,我们提出ViaRL——首个基于规则强化学习(RL)优化意图驱动视频理解中帧选择的框架。该框架采用迭代放大策略,在视频思维链系统中进行交替循环训练,各组件通过迭代优化周期持续提升能力。ViaRL利用下游模型的答案准确度作为奖励信号,通过试错机制训练帧选择器,既无需昂贵标注,又高度契合人类学习机制。在VideoMME、LVBench和MLVU等多个基准测试中的综合实验表明,ViaRL在不同视频理解任务中始终提供卓越的时间定位性能和鲁棒泛化能力,凸显其有效性与可扩展性。值得注意的是,在MLVU子集Needle QA(需在长视频中定位特定片段,被视为评估时间定位最合适的基准之一)上,ViaRL实现了近15%的性能提升。


A Qualitative Investigation into LLM-Generated Multilingual Code Comments and Automatic Evaluation Metrics

Abstract

arXiv:2505.15469v1 Announce Type: cross Abstract: Large Language Models are essential coding assistants, yet their training is predominantly English-centric. In this study, we evaluate the performance of code language models in non-English contexts, identifying challenges in their adoption and integration into multilingual workflows. We conduct an open-coding study to analyze errors in code comments generated by five state-of-the-art code models, CodeGemma, CodeLlama, CodeQwen1.5, GraniteCode, and StarCoder2 across five natural languages: Chinese, Dutch, English, Greek, and Polish. Our study yields a dataset of 12,500 labeled generations, which we publicly release. We then assess the reliability of standard metrics in capturing comment \textit{correctness} across languages and evaluate their trustworthiness as judgment criteria. Through our open-coding investigation, we identified a taxonomy of 26 distinct error categories in model-generated code comments. They highlight variations in language cohesion, informativeness, and syntax adherence across different natural languages. Our analysis shows that, while these models frequently produce partially correct comments, modern neural metrics fail to reliably differentiate meaningful completions from random noise. Notably, the significant score overlap between expert-rated correct and incorrect comments calls into question the effectiveness of these metrics in assessing generated comments.

摘要

大型语言模型已成为重要的编程辅助工具,但其训练过程主要基于英语语境。本研究评估了代码语言模型在非英语环境下的表现,揭示了其在多语言工作流程中应用与整合的挑战。我们通过开放式编码研究,分析了五种前沿代码模型(CodeGemma、CodeLlama、CodeQwen1.5、GraniteCode和StarCoder2)生成的代码注释在五种自然语言(中文、荷兰语、英语、希腊语和波兰语)中的错误模式,并公开发布了包含12,500条标注生成结果的数据集。随后,我们评估了标准指标在跨语言注释"正确性"衡量中的可靠性,检验了其作为评判标准的可信度。通过开放式编码调查,我们归纳出模型生成代码注释中26类典型错误,这些错误凸显了不同自然语言在语言连贯性、信息量和语法遵循方面的差异。分析表明,尽管这些模型常生成部分正确的注释,但现代神经指标无法有效区分有意义的生成结果与随机噪声。值得注意的是,专家评定的正确与错误注释之间存在显著分数重叠,这对现有指标评估生成注释的有效性提出了质疑。


LFTF: Locating First and Then Fine-Tuning for Mitigating Gender Bias in Large Language Models

Abstract

arXiv:2505.15475v1 Announce Type: cross Abstract: Nowadays, Large Language Models (LLMs) have attracted widespread attention due to their powerful performance. However, due to the unavoidable exposure to socially biased data during training, LLMs tend to exhibit social biases, particularly gender bias. To better explore and quantifying the degree of gender bias in LLMs, we propose a pair of datasets named GenBiasEval and GenHintEval, respectively. The GenBiasEval is responsible for evaluating the degree of gender bias in LLMs, accompanied by an evaluation metric named AFGB-Score (Absolutely Fair Gender Bias Score). Meanwhile, the GenHintEval is used to assess whether LLMs can provide responses consistent with prompts that contain gender hints, along with the accompanying evaluation metric UB-Score (UnBias Score). Besides, in order to mitigate gender bias in LLMs more effectively, we present the LFTF (Locating First and Then Fine-Tuning) algorithm.The algorithm first ranks specific LLM blocks by their relevance to gender bias in descending order using a metric called BMI (Block Mitigating Importance Score). Based on this ranking, the block most strongly associated with gender bias is then fine-tuned using a carefully designed loss function. Numerous experiments have shown that our proposed LFTF algorithm can significantly mitigate gender bias in LLMs while maintaining their general capabilities.

摘要

当前,大语言模型(LLM)因其强大性能受到广泛关注。然而由于训练过程中不可避免地接触社会偏见数据,LLM往往表现出社会偏见,尤其是性别偏见。为深入探究并量化LLM的性别偏见程度,我们分别提出名为GenBiasEval和GenHintEval的配对数据集。其中GenBiasEval负责评估LLM的性别偏见程度,并配套AFGB-Score(绝对公平性别偏见分数)评估指标;GenHintEval则用于检测LLM能否对含性别提示的指令作出一致性响应,配套UB-Score(无偏见分数)评估指标。此外,为更有效缓解LLM的性别偏见,我们提出LFTF(定位优先再微调)算法。该算法首先通过BMI(区块缓解重要性分数)指标按与性别偏见的关联度降序排列特定LLM区块,继而基于排序结果,采用精心设计的损失函数对性别偏见关联最强的区块进行微调。大量实验表明,我们提出的LFTF算法能在保持LLM通用能力的同时显著降低其性别偏见。


DayDreamer at CQs-Gen 2025: Generating Critical Questions through Argument Scheme Completion

Abstract

arXiv:2505.15554v1 Announce Type: cross Abstract: Critical questions are essential resources to provoke critical thinking when encountering an argumentative text. We present our system for the Critical Questions Generation (CQs-Gen) Shared Task at ArgMining 2025. Our approach leverages large language models (LLMs) with chain-of-thought prompting to generate critical questions guided by Walton's argumentation schemes. For each input intervention, we conversationally prompt LLMs to instantiate the corresponding argument scheme template to first obtain structured arguments, and then generate relevant critical questions. Following this, we rank all the available critical questions by prompting LLMs to select the top 3 most helpful questions based on the original intervention text. This combination of structured argumentation theory and step-by-step reasoning enables the generation of contextually relevant and diverse critical questions. Our pipeline achieves competitive performance in the final test set, showing its potential to foster critical thinking given argumentative text and detect missing or uninformed claims. Code available at \href{https://git.ecdf.ed.ac.uk/s2236454/DayDreamer-CQs-Gen&#125;&#123;DayDreamer&#125;.

摘要

批判性问题是在遇到论证性文本时激发批判性思维的重要资源。本文介绍了我们为ArgMining 2025"批判性问题生成(CQs-Gen)共享任务"开发的系统。我们的方法利用大语言模型(LLMs)结合思维链提示技术,基于Walton的论证方案生成批判性问题。对于每个输入干预,我们通过对话式提示让LLMs实例化相应的论证方案模板,首先生成结构化论证,进而产生相关批判性问题。随后,我们通过提示LLMs根据原始干预文本筛选出最有帮助的3个问题,对所有可用批判性问题进行排序。这种结构化论证理论与逐步推理相结合的方法,能够生成语境相关且多样化的批判性问题。我们的流程在最终测试集中表现出竞争力,展现了其在促进论证文本批判性思维、检测缺失或无依据主张方面的潜力。代码发布于\href{https://git.ecdf.ed.ac.uk/s2236454/DayDreamer-CQs-Gen&#125;&#123;DayDreamer&#125;。


Evaluate Bias without Manual Test Sets: A Concept Representation Perspective for LLMs

Abstract

arXiv:2505.15524v1 Announce Type: cross Abstract: Bias in Large Language Models (LLMs) significantly undermines their reliability and fairness. We focus on a common form of bias: when two reference concepts in the model's concept space, such as sentiment polarities (e.g., "positive" and "negative"), are asymmetrically correlated with a third, target concept, such as a reviewing aspect, the model exhibits unintended bias. For instance, the understanding of "food" should not skew toward any particular sentiment. Existing bias evaluation methods assess behavioral differences of LLMs by constructing labeled data for different social groups and measuring model responses across them, a process that requires substantial human effort and captures only a limited set of social concepts. To overcome these limitations, we propose BiasLens, a test-set-free bias analysis framework based on the structure of the model's vector space. BiasLens combines Concept Activation Vectors (CAVs) with Sparse Autoencoders (SAEs) to extract interpretable concept representations, and quantifies bias by measuring the variation in representational similarity between the target concept and each of the reference concepts. Even without labeled data, BiasLens shows strong agreement with traditional bias evaluation metrics (Spearman correlation r > 0.85). Moreover, BiasLens reveals forms of bias that are difficult to detect using existing methods. For example, in simulated clinical scenarios, a patient's insurance status can cause the LLM to produce biased diagnostic assessments. Overall, BiasLens offers a scalable, interpretable, and efficient paradigm for bias discovery, paving the way for improving fairness and transparency in LLMs.

摘要

大型语言模型(LLMs)中的偏见严重损害了其可靠性与公平性。我们关注一种常见偏见形式:当模型概念空间中两个参照概念(如情感极性"积极"与"消极")与第三个目标概念(如评论方面)存在非对称关联时,模型会表现出非预期偏见。例如,对"食物"的理解不应偏向任何特定情感。现有偏见评估方法通过为不同社会群体构建标注数据并测量模型响应差异进行评估,这一过程需耗费大量人力且仅能捕捉有限社会概念。为突破这些限制,我们提出BiasLens——基于模型向量空间结构的无测试集偏见分析框架。该框架将概念激活向量(CAVs)与稀疏自编码器(SAEs)相结合以提取可解释的概念表征,并通过量化目标概念与各参照概念之间表征相似性的变异程度来测量偏见。即使在没有标注数据的情况下,BiasLens与传统偏见评估指标仍保持高度一致性(斯皮尔曼相关系数r > 0.85)。此外,BiasLens能揭示现有方法难以检测的偏见形式,例如在模拟临床场景中,患者的保险状态会导致LLM产生带有偏见的诊断评估。总体而言,BiasLens为偏见发现提供了可扩展、可解释且高效的范式,为提升LLMs的公平性与透明度开辟了新路径。


Abstract

arXiv:2505.15553v1 Announce Type: cross Abstract: Question-answering (QA) and reading comprehension (RC) benchmarks are essential for assessing the capabilities of large language models (LLMs) in retrieving and reproducing knowledge. However, we demonstrate that popular QA and RC benchmarks are biased and do not cover questions about different demographics or regions in a representative way, potentially due to a lack of diversity of those involved in their creation. We perform a qualitative content analysis of 30 benchmark papers and a quantitative analysis of 20 respective benchmark datasets to learn (1) who is involved in the benchmark creation, (2) how social bias is addressed or prevented, and (3) whether the demographics of the creators and annotators correspond to particular biases in the content. Most analyzed benchmark papers provided insufficient information regarding the stakeholders involved in benchmark creation, particularly the annotators. Notably, just one of the benchmark papers explicitly reported measures taken to address social representation issues. Moreover, the data analysis revealed gender, religion, and geographic biases across a wide range of encyclopedic, commonsense, and scholarly benchmarks. More transparent and bias-aware QA and RC benchmark creation practices are needed to facilitate better scrutiny and incentivize the development of fairer LLMs.

摘要

问答(QA)和阅读理解(RC)基准测试对于评估大语言模型(LLM)在知识检索与复现方面的能力至关重要。然而,我们发现流行的QA和RC基准测试存在偏见,且未能以代表性方式涵盖不同人口统计特征或地区的问题,这可能是由于创建者缺乏多样性所致。通过对30篇基准测试论文的定性内容分析及20个对应基准数据集的定量分析,我们探究了以下问题:(1)谁参与了基准测试的创建;(2)如何解决或预防社会偏见;(3)创建者与标注者的人口统计特征是否与内容中的特定偏见相关。大多数被分析的基准测试论文未充分说明参与创建过程的利益相关者(尤其是标注者)信息。值得注意的是,仅有一篇论文明确报告了为解决社会代表性议题采取的措施。此外,数据分析揭示了百科全书类、常识类和学术类基准测试中普遍存在的性别、宗教与地域偏见。需要建立更透明且具偏见意识的QA和RC基准测试创建规范,以促进更严格的审查机制,并推动开发更公平的大语言模型。


From Problem-Solving to Teaching Problem-Solving: Aligning LLMs with Pedagogy using Reinforcement Learning

Abstract

arXiv:2505.15607v1 Announce Type: cross Abstract: Large language models (LLMs) can transform education, but their optimization for direct question-answering often undermines effective pedagogy which requires strategically withholding answers. To mitigate this, we propose an online reinforcement learning (RL)-based alignment framework that can quickly adapt LLMs into effective tutors using simulated student-tutor interactions by emphasizing pedagogical quality and guided problem-solving over simply giving away answers. We use our method to train a 7B parameter tutor model without human annotations which reaches similar performance to larger proprietary models like LearnLM. We introduce a controllable reward weighting to balance pedagogical support and student solving accuracy, allowing us to trace the Pareto frontier between these two objectives. Our models better preserve reasoning capabilities than single-turn SFT baselines and can optionally enhance interpretability through thinking tags that expose the model's instructional planning.

摘要

大型语言模型(LLMs)能够变革教育领域,但其针对直接问答的优化往往会削弱需要策略性保留答案的有效教学法。为此,我们提出一种基于在线强化学习(RL)的对齐框架,该框架通过模拟师生互动,强调教学质量和引导式问题解决而非直接提供答案,从而快速将LLMs适配为高效辅导工具。我们采用该方法训练了一个无需人工标注的70亿参数辅导模型,其性能可媲美LearnLM等大型专有模型。我们引入可控奖励加权机制以平衡教学支持与学生解题准确率,从而追踪这两个目标之间的帕累托前沿。相较于单轮监督微调基线模型,我们的模型能更好地保持推理能力,并可通过暴露教学规划过程的思维标签来选择性增强可解释性。


Listen to the Context: Towards Faithful Large Language Models for Retrieval Augmented Generation on Climate Questions

Abstract

arXiv:2505.15633v1 Announce Type: cross Abstract: Large language models that use retrieval augmented generation have the potential to unlock valuable knowledge for researchers, policymakers, and the public by making long and technical climate-related documents more accessible. While this approach can help alleviate factual hallucinations by relying on retrieved passages as additional context, its effectiveness depends on whether the model's output remains faithful to these passages. To address this, we explore the automatic assessment of faithfulness of different models in this setting. We then focus on ClimateGPT, a large language model specialised in climate science, to examine which factors in its instruction fine-tuning impact the model's faithfulness. By excluding unfaithful subsets of the model's training data, we develop ClimateGPT Faithful+, which achieves an improvement in faithfulness from 30% to 57% in supported atomic claims according to our automatic metric.

摘要

使用检索增强生成的大型语言模型有望通过提高气候相关技术文档的可读性,为研究人员、政策制定者和公众释放宝贵的知识价值。尽管这种方法通过依赖检索段落作为附加语境有助于缓解事实性幻觉问题,但其有效性取决于模型输出是否忠实于这些段落。为此,我们探索了该场景下不同模型忠实度的自动化评估方法。随后,我们聚焦气候科学专用大语言模型ClimateGPT,研究其指令微调中影响模型忠实度的关键因素。通过排除训练数据中不忠实的子集,我们开发出ClimateGPT Faithful+版本,根据自动化指标显示,该版本在支持性原子主张方面的忠实度从30%提升至57%。


Exploring LLM-Generated Feedback for Economics Essays: How Teaching Assistants Evaluate and Envision Its Use

Abstract

arXiv:2505.15596v1 Announce Type: cross Abstract: This project examines the prospect of using AI-generated feedback as suggestions to expedite and enhance human instructors' feedback provision. In particular, we focus on understanding the teaching assistants' perspectives on the quality of AI-generated feedback and how they may or may not utilize AI feedback in their own workflows. We situate our work in a foundational college Economics class, which has frequent short essay assignments. We developed an LLM-powered feedback engine that generates feedback on students' essays based on grading rubrics used by the teaching assistants (TAs). To ensure that TAs can meaningfully critique and engage with the AI feedback, we had them complete their regular grading jobs. For a randomly selected set of essays that they had graded, we used our feedback engine to generate feedback and displayed the feedback as in-text comments in a Word document. We then performed think-aloud studies with 5 TAs over 20 1-hour sessions to have them evaluate the AI feedback, contrast the AI feedback with their handwritten feedback, and share how they envision using the AI feedback if they were offered as suggestions. The study highlights the importance of providing detailed rubrics for AI to generate high-quality feedback for knowledge-intensive essays. TAs considered that using AI feedback as suggestions during their grading could expedite grading, enhance consistency, and improve overall feedback quality. We discuss the importance of decomposing the feedback generation task into steps and presenting intermediate results, in order for TAs to use the AI feedback.

摘要

本项目探讨了利用人工智能生成反馈作为建议,以加速和提升人类教师反馈提供的可能性。我们特别关注助教对AI生成反馈质量的看法,以及他们如何在其工作流程中利用或不用AI反馈。研究基于一门基础大学经济学课程展开,该课程设有频繁的短文作业。我们开发了一个基于大型语言模型的反馈引擎,该引擎根据助教使用的评分标准生成学生论文反馈。为确保助教能够有意义地评价和参与AI反馈,我们要求他们完成常规评分工作。针对随机选取的已评分论文,我们使用反馈引擎生成反馈,并以Word文档内批注形式呈现。随后,我们与5位助教进行了20次1小时的出声思考研究,让他们评估AI反馈、对比AI反馈与其手写反馈,并分享若AI反馈作为建议提供时的使用设想。研究强调,为AI提供详细评分标准对生成知识密集型论文的高质量反馈至关重要。助教认为在评分过程中将AI反馈作为建议使用,可加速评分、增强一致性并提升整体反馈质量。我们讨论了将反馈生成任务分解为步骤并展示中间结果的重要性,以便助教有效利用AI反馈。


DEBATE, TRAIN, EVOLVE: Self Evolution of Language Model Reasoning

Abstract

arXiv:2505.15734v1 Announce Type: cross Abstract: Large language models (LLMs) have improved significantly in their reasoning through extensive training on massive datasets. However, relying solely on additional data for improvement is becoming increasingly impractical, highlighting the need for models to autonomously enhance their reasoning without external supervision. In this paper, we propose Debate, Train, Evolve (DTE), a novel ground truth-free training framework that uses multi-agent debate traces to evolve a single language model. We also introduce a new prompting strategy Reflect-Critique-Refine, to improve debate quality by explicitly instructing agents to critique and refine their reasoning. Extensive evaluations on five reasoning benchmarks with six open-weight models show that our DTE framework achieve substantial improvements, with an average accuracy gain of 8.92% on the challenging GSM-PLUS dataset. Furthermore, we observe strong cross-domain generalization, with an average accuracy gain of 5.8% on all other benchmarks, suggesting that our method captures general reasoning capabilities.

摘要

大型语言模型(LLMs)通过海量数据集的广泛训练,其推理能力已显著提升。然而,单纯依赖额外数据进行改进的做法正变得日益不可行,这凸显了模型需在无外部监督条件下自主增强推理能力的必要性。本文提出"辩论-训练-进化"(DTE)框架,这是一种无需真实标注的新型训练范式,通过多智能体辩论轨迹来进化单一语言模型。我们同时引入"反思-批判-优化"提示策略,通过显式指导智能体批判与精炼其推理过程来提升辩论质量。在五个推理基准测试和六个开源模型上的大量实验表明,DTE框架实现了显著改进,在极具挑战性的GSM-PLUS数据集上平均准确率提升达8.92%。此外,我们观察到强大的跨领域泛化能力,所有其他基准测试平均准确率提升5.8%,这表明我们的方法能够捕捉通用推理能力。


Learn to Reason Efficiently with Adaptive Length-based Reward Shaping

Abstract

arXiv:2505.15612v1 Announce Type: cross Abstract: Large Reasoning Models (LRMs) have shown remarkable capabilities in solving complex problems through reinforcement learning (RL), particularly by generating long reasoning traces. However, these extended outputs often exhibit substantial redundancy, which limits the efficiency of LRMs. In this paper, we investigate RL-based approaches to promote reasoning efficiency. Specifically, we first present a unified framework that formulates various efficient reasoning methods through the lens of length-based reward shaping. Building on this perspective, we propose a novel Length-bAsed StEp Reward shaping method (LASER), which employs a step function as the reward, controlled by a target length. LASER surpasses previous methods, achieving a superior Pareto-optimal balance between performance and efficiency. Next, we further extend LASER based on two key intuitions: (1) The reasoning behavior of the model evolves during training, necessitating reward specifications that are also adaptive and dynamic; (2) Rather than uniformly encouraging shorter or longer chains of thought (CoT), we posit that length-based reward shaping should be difficulty-aware i.e., it should penalize lengthy CoTs more for easy queries. This approach is expected to facilitate a combination of fast and slow thinking, leading to a better overall tradeoff. The resulting method is termed LASER-D (Dynamic and Difficulty-aware). Experiments on DeepSeek-R1-Distill-Qwen-1.5B, DeepSeek-R1-Distill-Qwen-7B, and DeepSeek-R1-Distill-Qwen-32B show that our approach significantly enhances both reasoning performance and response length efficiency. For instance, LASER-D and its variant achieve a +6.1 improvement on AIME2024 while reducing token usage by 63%. Further analysis reveals our RL-based compression produces more concise reasoning patterns with less redundant "self-reflections". Resources are at https://github.com/hkust-nlp/Laser.

摘要

大型推理模型(LRMs)通过强化学习(RL)在解决复杂问题方面展现出卓越能力,尤其是通过生成长推理轨迹。然而,这些扩展输出常存在显著冗余,限制了LRMs的效率。本文研究基于RL的提升推理效率方法。具体而言,我们首先提出一个统一框架,通过基于长度的奖励塑形视角形式化多种高效推理方法。基于此视角,我们提出一种新颖的基于长度的阶梯奖励塑形方法(LASER),该方法采用由目标长度控制的阶梯函数作为奖励。LASER超越了先前方法,在性能与效率间实现了更优的帕累托最优平衡。进一步地,我们基于两个关键直觉扩展LASER:(1)模型推理行为在训练过程中动态演变,需要奖励机制具备自适应与动态特性;(2)与其统一鼓励短或长思维链(CoT),我们认为基于长度的奖励塑形应具备难度感知能力,即对简单查询应更严厉地惩罚冗长CoT。该方法有望促进快慢思维的结合,实现更好的整体权衡。改进后的方法称为LASER-D(动态与难度感知型)。在DeepSeek-R1-Distill-Qwen-1.5B、DeepSeek-R1-Distill-Qwen-7B和DeepSeek-R1-Distill-Qwen-32B上的实验表明,我们的方法显著提升了推理性能和响应长度效率。例如,LASER-D及其变体在AIME2024上实现+6.1分提升的同时减少63%的token使用量。进一步分析表明,基于RL的压缩产生了更简洁的推理模式,冗余的"自我反思"更少。资源见https://github.com/hkust-nlp/Laser。


A Federated Splitting Framework for LLMs: Security, Efficiency, and Adaptability

Abstract

arXiv:2505.15683v1 Announce Type: cross Abstract: Private data is typically larger and of higher quality than public data, offering great potential to improve LLM. However, its scattered distribution across data silos and the high computational demands of LLMs limit their deployment in federated environments. To address this, the transformer-based split learning model has emerged, offloading most model parameters to the server while retaining only the embedding and output layers on clients to ensure privacy. However, it still faces significant challenges in security, efficiency, and adaptability: 1) embedding gradients are vulnerable to attacks, leading to reverse engineering of private data; 2) the autoregressive nature of LLMs means that federated split learning can only train and infer sequentially, causing high communication overhead; 3) fixed partition points lack adaptability to downstream tasks. In this paper, we introduce FL-LLaMA, a secure, efficient, and adaptive federated split framework based on LLaMA2. First, we place some input and output blocks on the local client and inject Gaussian noise into forward-pass hidden states, enabling secure end-to-end propagation. Second, we employ client-batch and server-hierarchical strategies to achieve parallel training, along with attention-mask compression and KV cache mechanisms to accelerate inference, reducing communication costs effectively. Third, we allow users to dynamically adjust the partition points for input/output blocks based on specific task requirements and hardware limitations. Experiments on NLU, summarization and conversational QA tasks show that FL-LLaMA maintains performance comparable to centralized LLaMA2, and achieves up to 2x train speedups and 8x inference speedups. Further analysis of privacy attacks and different partition points also demonstrates the effectiveness of FL-LLaMA in security and adaptability.

摘要

私有数据通常比公共数据规模更大、质量更高,为提升大语言模型(LLM)性能提供了巨大潜力。然而,这些数据分散存储于各数据孤岛,加之LLM的高计算需求,限制了其在联邦环境中的部署。为此,基于Transformer的拆分学习模型应运而生,它将大部分模型参数卸载至服务器,仅保留嵌入层和输出层在客户端以确保隐私。但该方案仍面临安全性、效率与适应性三大挑战:1)嵌入梯度易受攻击,可能导致私有数据被逆向工程;2)LLM的自回归特性使得联邦拆分学习只能串行训练推理,产生高通信开销;3)固定分割点缺乏对下游任务的适应性。本文提出FL-LLaMA,一个基于LLaMA2的安全、高效、自适应的联邦拆分框架。首先,我们在本地客户端部署部分输入输出模块,并对前向传播的隐藏状态注入高斯噪声,实现安全的端到端传播。其次,采用客户端批处理与服务器分层策略实现并行训练,结合注意力掩码压缩和KV缓存机制加速推理,有效降低通信成本。第三,允许用户根据任务需求与硬件限制动态调整输入/输出模块的分割点。在自然语言理解、文本摘要和对话问答任务上的实验表明,FL-LLaMA保持了与集中式LLaMA2相当的性能,训练速度提升达2倍,推理速度提升达8倍。针对隐私攻击和不同分割点的进一步分析也验证了FL-LLaMA在安全性与适应性方面的有效性。


Scalable Defense against In-the-wild Jailbreaking Attacks with Safety Context Retrieval

Abstract

arXiv:2505.15753v1 Announce Type: cross Abstract: Large Language Models (LLMs) are known to be vulnerable to jailbreaking attacks, wherein adversaries exploit carefully engineered prompts to induce harmful or unethical responses. Such threats have raised critical concerns about the safety and reliability of LLMs in real-world deployment. While existing defense mechanisms partially mitigate such risks, subsequent advancements in adversarial techniques have enabled novel jailbreaking methods to circumvent these protections, exposing the limitations of static defense frameworks. In this work, we explore defending against evolving jailbreaking threats through the lens of context retrieval. First, we conduct a preliminary study demonstrating that even a minimal set of safety-aligned examples against a particular jailbreak can significantly enhance robustness against this attack pattern. Building on this insight, we further leverage the retrieval-augmented generation (RAG) techniques and propose Safety Context Retrieval (SCR), a scalable and robust safeguarding paradigm for LLMs against jailbreaking. Our comprehensive experiments demonstrate how SCR achieves superior defensive performance against both established and emerging jailbreaking tactics, contributing a new paradigm to LLM safety. Our code will be available upon publication.

摘要

大型语言模型(LLMs)已知易受越狱攻击的影响,攻击者通过精心设计的提示诱导模型产生有害或不道德的响应。此类威胁引发了人们对LLM在实际部署中安全性与可靠性的严重关切。尽管现有防御机制能部分缓解此类风险,但对抗技术的持续进步使得新型越狱方法能够绕过这些保护措施,暴露出静态防御框架的局限性。本研究从上下文检索的视角探索防御不断演变的越狱威胁。首先,我们通过初步实验证明:即使针对特定越狱攻击使用极小规模的安全对齐示例,也能显著增强模型对该攻击模式的鲁棒性。基于这一发现,我们进一步结合检索增强生成(RAG)技术,提出安全上下文检索(SCR)——一种可扩展且鲁棒的LLM越狱防护范式。全面实验表明,SCR对既有及新兴越狱策略均能实现卓越的防御性能,为LLM安全领域贡献了新范式。代码将在论文发表时公开。


Shared Path: Unraveling Memorization in Multilingual LLMs through Language Similarities

Abstract

arXiv:2505.15722v1 Announce Type: cross Abstract: We present the first comprehensive study of Memorization in Multilingual Large Language Models (MLLMs), analyzing 95 languages using models across diverse model scales, architectures, and memorization definitions. As MLLMs are increasingly deployed, understanding their memorization behavior has become critical. Yet prior work has focused primarily on monolingual models, leaving multilingual memorization underexplored, despite the inherently long-tailed nature of training corpora. We find that the prevailing assumption, that memorization is highly correlated with training data availability, fails to fully explain memorization patterns in MLLMs. We hypothesize that treating languages in isolation - ignoring their similarities - obscures the true patterns of memorization. To address this, we propose a novel graph-based correlation metric that incorporates language similarity to analyze cross-lingual memorization. Our analysis reveals that among similar languages, those with fewer training tokens tend to exhibit higher memorization, a trend that only emerges when cross-lingual relationships are explicitly modeled. These findings underscore the importance of a language-aware perspective in evaluating and mitigating memorization vulnerabilities in MLLMs. This also constitutes empirical evidence that language similarity both explains Memorization in MLLMs and underpins Cross-lingual Transferability, with broad implications for multilingual NLP.

摘要

我们首次对多语言大语言模型(MLLMs)中的记忆现象进行了全面研究,通过分析95种语言,涵盖了不同模型规模、架构和记忆定义的模型。随着MLLMs的广泛应用,理解其记忆行为变得至关重要。然而,先前的研究主要集中于单语言模型,尽管训练语料库本质上是长尾分布的,但多语言记忆现象仍未得到充分探索。我们发现,当前普遍认为记忆与训练数据可用性高度相关的假设,并不能完全解释MLLMs中的记忆模式。我们提出假设,孤立地看待语言——忽略其相似性——会掩盖真实的记忆模式。为此,我们提出了一种新颖的基于图的关联度量方法,该方法结合语言相似性来分析跨语言记忆。我们的分析表明,在相似语言中,训练标记较少的语言往往表现出更高的记忆性,这一趋势只有在显式建模跨语言关系时才会显现。这些发现强调了在评估和缓解MLLMs记忆漏洞时采用语言感知视角的重要性。这也为语言相似性既解释了MLLMs中的记忆现象,又支撑了跨语言可迁移性提供了实证证据,对多语言自然语言处理领域具有广泛意义。


Evolutionary Computation and Large Language Models: A Survey of Methods, Synergies, and Applications

Abstract

arXiv:2505.15741v1 Announce Type: cross Abstract: Integrating Large Language Models (LLMs) and Evolutionary Computation (EC) represents a promising avenue for advancing artificial intelligence by combining powerful natural language understanding with optimization and search capabilities. This manuscript explores the synergistic potential of LLMs and EC, reviewing their intersections, complementary strengths, and emerging applications. We identify key opportunities where EC can enhance LLM training, fine-tuning, prompt engineering, and architecture search, while LLMs can, in turn, aid in automating the design, analysis, and interpretation of ECs. The manuscript explores the synergistic integration of EC and LLMs, highlighting their bidirectional contributions to advancing artificial intelligence. It first examines how EC techniques enhance LLMs by optimizing key components such as prompt engineering, hyperparameter tuning, and architecture search, demonstrating how evolutionary methods automate and refine these processes. Secondly, the survey investigates how LLMs improve EC by automating metaheuristic design, tuning evolutionary algorithms, and generating adaptive heuristics, thereby increasing efficiency and scalability. Emerging co-evolutionary frameworks are discussed, showcasing applications across diverse fields while acknowledging challenges like computational costs, interpretability, and algorithmic convergence. The survey concludes by identifying open research questions and advocating for hybrid approaches that combine the strengths of EC and LLMs.

摘要

整合大型语言模型(LLMs)与进化计算(EC)为人工智能发展提供了前景广阔的路径,通过将强大的自然语言理解能力与优化搜索技术相结合。本文探讨了LLMs与EC的协同潜力,综述了二者的交叉领域、互补优势及新兴应用。我们指出EC可优化LLM训练、微调、提示工程和架构搜索的关键机遇,同时LLMs也能助力EC的自动化设计、分析与解释。研究深入分析了EC与LLMs的协同融合,着重阐释二者对推动人工智能发展的双向贡献:首先论证EC技术如何通过优化提示工程、超参数调优和架构搜索等关键组件来增强LLMs,展示进化方法如何实现这些流程的自动化与精调;其次探讨LLMs如何通过自动化元启发式设计、调优进化算法及生成自适应启发式来提升EC的效率和可扩展性。文中讨论了新兴的协同进化框架在跨领域中的应用实例,同时指出计算成本、可解释性和算法收敛性等挑战。最后提出开放研究问题,倡导结合EC与LLMs优势的混合研究方法。


UniErase: Unlearning Token as a Universal Erasure Primitive for Language Models

Abstract

arXiv:2505.15674v1 Announce Type: cross Abstract: Large language models require iterative updates to address challenges such as knowledge conflicts and outdated information (e.g., incorrect, private, or illegal contents). Machine unlearning provides a systematic methodology for targeted knowledge removal from trained models, enabling elimination of sensitive information influences. However, mainstream fine-tuning-based unlearning methods often fail to balance unlearning efficacy and model ability, frequently resulting in catastrophic model collapse under extensive knowledge removal. Meanwhile, in-context unlearning, which relies solely on contextual prompting without modifying the model's intrinsic mechanisms, suffers from limited generalizability and struggles to achieve true unlearning. In this work, we introduce UniErase, a novel unlearning paradigm that employs learnable parametric suffix (unlearning token) to steer language models toward targeted forgetting behaviors. UniErase operates through two key phases: (I) an optimization stage that binds desired unlearning outputs to the model's autoregressive probability distribution via token optimization, followed by (II) a lightweight model editing phase that activates the learned token to probabilistically induce specified forgetting objective. Serving as a new research direction for token learning to induce unlearning target, UniErase achieves state-of-the-art (SOTA) performance across batch, sequential, and precise unlearning under fictitious and real-world knowledge settings. Remarkably, in terms of TOFU benchmark, UniErase, modifying only around 3.66% of the LLM parameters, outperforms previous forgetting SOTA baseline by around 4.01 times for model ability with even better unlearning efficacy. Similarly, UniErase, maintaining more ability, also surpasses previous retaining SOTA by 35.96% for unlearning efficacy, showing dual top-tier performances in current unlearing domain.

摘要

大型语言模型需要通过迭代更新来解决知识冲突和过时信息(如错误、隐私或非法内容)等挑战。机器遗忘为从训练模型中定向移除知识提供了系统性方法,可消除敏感信息的影响。然而,基于微调的主流遗忘方法往往难以平衡遗忘效能与模型能力,在大量知识移除时频繁导致灾难性模型崩溃。而仅依赖上下文提示、不修改模型内在机制的上下文遗忘方法,存在泛化性有限的问题,难以实现真正的遗忘。本研究提出UniErase这一新型遗忘范式,它采用可学习的参数化后缀(遗忘标记)来引导语言模型实现定向遗忘行为。UniErase通过两个关键阶段运作:(I)通过标记优化将期望的遗忘输出绑定至模型自回归概率分布的优化阶段;(II)激活已学习标记以概率化诱导指定遗忘目标的轻量级模型编辑阶段。作为通过标记学习实现遗忘目标的新研究方向,UniErase在虚构和真实世界知识设定下的批量、序列及精确遗忘任务中均达到最先进(SOTA)性能。值得注意的是,在TOFU基准测试中,UniErase仅修改约3.66%的大语言模型参数,其模型能力表现超越先前遗忘SOTA基线约4.01倍,同时具备更优的遗忘效能。同样地,UniErase在保持更强能力的同时,其遗忘效能较先前保留SOTA方法提升35.96%,在当前遗忘领域展现出双重顶尖性能。


Alignment Under Pressure: The Case for Informed Adversaries When Evaluating LLM Defenses

Abstract

arXiv:2505.15738v1 Announce Type: cross Abstract: Large language models (LLMs) are rapidly deployed in real-world applications ranging from chatbots to agentic systems. Alignment is one of the main approaches used to defend against attacks such as prompt injection and jailbreaks. Recent defenses report near-zero Attack Success Rates (ASR) even against Greedy Coordinate Gradient (GCG), a white-box attack that generates adversarial suffixes to induce attacker-desired outputs. However, this search space over discrete tokens is extremely large, making the task of finding successful attacks difficult. GCG has, for instance, been shown to converge to local minima, making it sensitive to initialization choices. In this paper, we assess the future-proof robustness of these defenses using a more informed threat model: attackers who have access to some information about the alignment process. Specifically, we propose an informed white-box attack leveraging the intermediate model checkpoints to initialize GCG, with each checkpoint acting as a stepping stone for the next one. We show this approach to be highly effective across state-of-the-art (SOTA) defenses and models. We further show our informed initialization to outperform other initialization methods and show a gradient-informed checkpoint selection strategy to greatly improve attack performance and efficiency. Importantly, we also show our method to successfully find universal adversarial suffixes -- single suffixes effective across diverse inputs. Our results show that, contrary to previous beliefs, effective adversarial suffixes do exist against SOTA alignment-based defenses, that these can be found by existing attack methods when adversaries exploit alignment knowledge, and that even universal suffixes exist. Taken together, our results highlight the brittleness of current alignment-based methods and the need to consider stronger threat models when testing the safety of LLMs.

摘要

大型语言模型(LLMs)正快速应用于从聊天机器人到代理系统的现实场景中。对齐技术是抵御提示注入和越狱等攻击的主要方法之一。近期防御方案显示,即便面对贪婪坐标梯度(GCG)这类生成对抗性后缀以诱导攻击者预期输出的白盒攻击,其攻击成功率(ASR)也接近零。然而,离散令牌的搜索空间极其庞大,使得寻找有效攻击变得困难。例如,GCG已被证明会收敛至局部极小值,因而对初始化选择极为敏感。本文通过更具信息量的威胁模型评估这些防御的未来稳健性:假设攻击者能获取对齐过程的某些信息。具体而言,我们提出一种利用中间模型检查点初始化GCG的知情白盒攻击方法,每个检查点作为后续攻击的跳板。实验证明该方法在当前最先进(SOTA)防御和模型中均高度有效。我们进一步表明,这种知情初始化优于其他初始化方法,并展示基于梯度的检查点选择策略可显著提升攻击性能和效率。值得注意的是,该方法还能成功发现通用对抗性后缀——即对多样化输入均有效的单一后缀。研究结果表明:与既往认知相反,针对SOTA基于对齐的防御确实存在有效对抗性后缀;当攻击者利用对齐知识时,现有攻击方法即可发现这些后缀;甚至存在通用后缀。这些发现共同揭示了当前基于对齐方法的脆弱性,并强调测试LLM安全性时需考虑更强威胁模型的必要性。


Multi-modal Integration Analysis of Alzheimer's Disease Using Large Language Models and Knowledge Graphs

Abstract

arXiv:2505.15747v1 Announce Type: cross Abstract: We propose a novel framework for integrating fragmented multi-modal data in Alzheimer's disease (AD) research using large language models (LLMs) and knowledge graphs. While traditional multimodal analysis requires matched patient IDs across datasets, our approach demonstrates population-level integration of MRI, gene expression, biomarkers, EEG, and clinical indicators from independent cohorts. Statistical analysis identified significant features in each modality, which were connected as nodes in a knowledge graph. LLMs then analyzed the graph to extract potential correlations and generate hypotheses in natural language. This approach revealed several novel relationships, including a potential pathway linking metabolic risk factors to tau protein abnormalities via neuroinflammation (r>0.6, p<0.001), and unexpected correlations between frontal EEG channels and specific gene expression profiles (r=0.42-0.58, p<0.01). Cross-validation with independent datasets confirmed the robustness of major findings, with consistent effect sizes across cohorts (variance <15%). The reproducibility of these findings was further supported by expert review (Cohen's k=0.82) and computational validation. Our framework enables cross modal integration at a conceptual level without requiring patient ID matching, offering new possibilities for understanding AD pathology through fragmented data reuse and generating testable hypotheses for future research.

摘要

我们提出了一种新颖框架,利用大语言模型(LLMs)和知识图谱整合阿尔茨海默病(AD)研究中的碎片化多模态数据。传统多模态分析要求跨数据集的患者ID匹配,而我们的方法实现了来自独立队列的MRI、基因表达、生物标志物、脑电图和临床指标在群体层面的整合。统计分析识别出各模态的显著特征,并将其作为节点连接至知识图谱。随后通过LLMs分析图谱以提取潜在关联并生成自然语言假设。该方法揭示了若干新关系,包括通过神经炎症将代谢风险因素与tau蛋白异常相联系的潜在通路(r>0.6,p<0.001),以及额叶脑电通道与特定基因表达谱之间的意外相关性(r=0.42-0.58,p<0.01)。独立数据集的交叉验证证实了主要发现的稳健性,各队列效应量保持一致(方差<15%)。这些发现的可重复性进一步得到专家评审(Cohen's k=0.82)和计算验证的支持。本框架实现了概念层面的跨模态整合,无需患者ID匹配,为通过碎片化数据重用理解AD病理机制提供了新途径,并为未来研究生成可检验假设。


HybridProver: Augmenting Theorem Proving with LLM-Driven Proof Synthesis and Refinement

Abstract

arXiv:2505.15740v1 Announce Type: cross Abstract: Formal methods is pivotal for verifying the reliability of critical systems through rigorous mathematical proofs. However, its adoption is hindered by labor-intensive manual proofs and the expertise required to use theorem provers. Recent advancements in large language models (LLMs) offer new opportunities for automated theorem proving. Two promising approaches are generating tactics step by step and generating a whole proof directly with an LLM. However, existing work makes no attempt to combine the two approaches. In this work, we introduce HybridProver, a dual-model proof synthesis framework that combines tactic-based generation and whole-proof synthesis to harness the benefits of both approaches. HybridProver generates whole proof candidates for evaluation directly, then extracts proof sketches from those candidates. It then uses a tactic-based generation model that integrates automated tools to complete the sketches via stepwise refinement. We implement HybridProver for the Isabelle theorem prover and fine-tune LLMs on our optimized Isabelle datasets. Evaluation on the miniF2F dataset illustrates HybridProver's effectiveness. We achieve a 59.4% success rate on miniF2F, where the previous SOTA is 56.1%. Our ablation studies show that this SOTA result is attributable to combining whole-proof and tactic-based generation. Additionally, we show how the dataset quality, training parameters, and sampling diversity affect the final result during automated theorem proving with LLMs. All of our code, datasets, and LLMs are open source.

摘要

形式化方法通过严格的数学证明对关键系统进行可靠性验证具有关键作用。然而,其应用受到劳动密集型的手动证明过程以及使用定理证明器所需专业知识的阻碍。大型语言模型(LLMs)的最新进展为自动定理证明提供了新的机遇。目前两种主流方法分别是逐步生成策略和直接利用LLM生成完整证明。然而现有研究尚未尝试将这两种方法相结合。本研究提出HybridProver——一个融合策略生成与整体证明合成的双模型证明合成框架,以兼收两种方法的优势。该框架首先生成完整证明候选方案进行评估,随后从中提取证明草图,再通过集成自动化工具的策略生成模型进行逐步精化以完成证明。我们针对Isabelle定理证明器实现了HybridProver,并在优化的Isabelle数据集上对LLMs进行微调。在miniF2F数据集上的评估表明该框架的有效性,其成功率达到59.4%(此前最高水平为56.1%)。消融实验证实这一最优结果源于整体证明与策略生成方法的结合。此外,我们揭示了在LLM自动定理证明过程中,数据集质量、训练参数和采样多样性对最终结果的影响机制。所有代码、数据集及LLMs均已开源。


On the Evolution of Knowledge Graphs: A Survey and Perspective

Abstract

arXiv:2310.04835v3 Announce Type: replace Abstract: Knowledge graphs (KGs) are structured representations of diversified knowledge. They are widely used in various intelligent applications. In this article, we provide a comprehensive survey on the evolution of various types of knowledge graphs (i.e., static KGs, dynamic KGs, temporal KGs, and event KGs) and techniques for knowledge extraction and reasoning. Furthermore, we introduce the practical applications of different types of KGs, including a case study in financial analysis. Finally, we propose our perspective on the future directions of knowledge engineering, including the potential of combining the power of knowledge graphs and large language models (LLMs), and the evolution of knowledge extraction, reasoning, and representation.

摘要

知识图谱(KGs)是多样化知识的结构化表示形式,广泛应用于各类智能应用中。本文全面综述了各类知识图谱(包括静态知识图谱、动态知识图谱、时序知识图谱和事件知识图谱)的演进历程,以及知识抽取与推理技术。此外,我们介绍了不同类型知识图谱的实际应用,包括金融分析中的案例研究。最后,我们展望了知识工程未来的发展方向,包括知识图谱与大型语言模型(LLMs)结合的潜力,以及知识抽取、推理与表示技术的演进趋势。


VerifyBench: Benchmarking Reference-based Reward Systems for Large Language Models

Abstract

arXiv:2505.15801v1 Announce Type: cross Abstract: Large reasoning models such as OpenAI o1 and DeepSeek-R1 have achieved remarkable performance in the domain of reasoning. A key component of their training is the incorporation of verifiable rewards within reinforcement learning (RL). However, existing reward benchmarks do not evaluate reference-based reward systems, leaving researchers with limited understanding of the accuracy of verifiers used in RL. In this paper, we introduce two benchmarks, VerifyBench and VerifyBench-Hard, designed to assess the performance of reference-based reward systems. These benchmarks are constructed through meticulous data collection and curation, followed by careful human annotation to ensure high quality. Current models still show considerable room for improvement on both VerifyBench and VerifyBench-Hard, especially smaller-scale models. Furthermore, we conduct a thorough and comprehensive analysis of evaluation results, offering insights for understanding and developing reference-based reward systems. Our proposed benchmarks serve as effective tools for guiding the development of verifier accuracy and the reasoning capabilities of models trained via RL in reasoning tasks.

摘要

OpenAI o1和DeepSeek-R1等大型推理模型在推理领域取得了显著性能。其训练的关键组成部分是在强化学习(RL)中引入可验证的奖励机制。然而,现有的奖励基准测试并未评估基于参考的奖励系统,导致研究人员对RL中使用的验证器准确性理解有限。本文提出两个基准测试VerifyBench和VerifyBench-Hard,旨在评估基于参考的奖励系统性能。这些基准通过细致的数据收集与整理构建,并经过人工标注以确保高质量。当前模型在VerifyBench和VerifyBench-Hard上仍显示出较大改进空间,尤其是小规模模型。此外,我们对评估结果进行了全面深入的分析,为理解和开发基于参考的奖励系统提供了见解。所提出的基准测试可作为有效工具,指导验证器准确性提升以及通过RL训练的模型在推理任务中的能力发展。


Large Language Models as Computable Approximations to Solomonoff Induction

Abstract

arXiv:2505.15784v1 Announce Type: cross Abstract: The rapid advancement of large language models (LLMs) calls for a rigorous theoretical framework to explain their empirical success. While significant progress has been made in understanding LLM behaviors, existing theoretical frameworks remain fragmented in explaining emergent phenomena through a unified mathematical lens. We establish the first formal connection between LLM architectures and Algorithmic Information Theory (AIT) by proving two fundamental results: (1) the training process computationally approximates Solomonoff prior through loss minimization interpreted as program length optimization, and (2) next-token prediction implements approximate Solomonoff induction. We leverage AIT to provide a unified theoretical explanation for in-context learning, few-shot learning, and scaling laws. Furthermore, our theoretical insights lead to a principled method for few-shot example selection that prioritizes samples where models exhibit lower predictive confidence. We demonstrate through experiments on diverse text classification benchmarks that this strategy yields significant performance improvements, particularly for smaller model architectures, when compared to selecting high-confidence examples. Our framework bridges the gap between theoretical foundations and practical LLM behaviors, providing both explanatory power and actionable insights for future model development.

摘要

大型语言模型(LLMs)的快速发展亟需建立严谨的理论框架以解释其经验性成功。尽管在理解LLM行为方面已取得显著进展,现有理论框架仍缺乏通过统一数学视角解释涌现现象的完整性。我们首次构建了LLM架构与算法信息论(AIT)之间的形式化关联,通过证明两个核心结论:(1)训练过程通过可解释为程序长度优化的损失最小化,实现了对Solomonoff先验的计算近似;(2)下一词预测执行了近似Solomonoff归纳。基于AIT,我们为上下文学习、小样本学习及缩放定律提供了统一的理论解释。进一步地,这些理论洞见催生了一种原则性的小样本示例选择方法,该方法优先选取模型预测置信度较低的样本。通过在多样化文本分类基准上的实验验证,相较于选择高置信度样本,该策略(尤其对于较小架构模型)能带来显著的性能提升。本框架弥合了理论基础与实际LLM行为之间的鸿沟,既具备解释力又为未来模型开发提供了可操作的见解。


Long-Form Information Alignment Evaluation Beyond Atomic Facts

Abstract

arXiv:2505.15792v1 Announce Type: cross Abstract: Information alignment evaluators are vital for various NLG evaluation tasks and trustworthy LLM deployment, reducing hallucinations and enhancing user trust. Current fine-grained methods, like FactScore, verify facts individually but neglect inter-fact dependencies, enabling subtle vulnerabilities. In this work, we introduce MontageLie, a challenging benchmark that constructs deceptive narratives by "montaging" truthful statements without introducing explicit hallucinations. We demonstrate that both coarse-grained LLM-based evaluators and current fine-grained frameworks are susceptible to this attack, with AUC-ROC scores falling below 65%. To enable more robust fine-grained evaluation, we propose DoveScore, a novel framework that jointly verifies factual accuracy and event-order consistency. By modeling inter-fact relationships, DoveScore outperforms existing fine-grained methods by over 8%, providing a more robust solution for long-form text alignment evaluation. Our code and datasets are available at https://github.com/dannalily/DoveScore.

摘要

信息对齐评估器对于各类自然语言生成评估任务及可信赖大语言模型部署至关重要,可减少幻觉并增强用户信任。当前细粒度方法(如FactScore)虽能独立验证事实,却忽略了事实间依赖关系,导致存在微妙漏洞。本研究提出MontageLie基准测试,通过"蒙太奇"式拼接真实陈述(不引入显式幻觉)构建具有欺骗性的叙述,其挑战性在于:实验表明基于大语言模型的粗粒度评估器和现有细粒度框架均易受此攻击,AUC-ROC分数均低于65%。为实现更鲁棒的细粒度评估,我们提出DoveScore框架,其创新性在于联合验证事实准确性与事件顺序一致性。通过建模事实间关联关系,DoveScore以超过8%的优势优于现有细粒度方法,为长文本对齐评估提供了更稳健的解决方案。代码及数据集详见https://github.com/dannalily/DoveScore。


"Did my figure do justice to the answer?" : Towards Multimodal Short Answer Grading with Feedback (MMSAF)

Abstract

arXiv:2412.19755v3 Announce Type: replace Abstract: Assessments play a vital role in a student's learning process. This is because they provide valuable feedback crucial to a student's growth. Such assessments contain questions with open-ended responses, which are difficult to grade at scale. These responses often require students to express their understanding through textual and visual elements together as a unit. In order to develop scalable assessment tools for such questions, one needs multimodal LLMs having strong comparative reasoning capabilities across multiple modalities. Thus, to facilitate research in this area, we propose the Multimodal Short Answer grading with Feedback (MMSAF) problem along with a dataset of 2,197 data points. Additionally, we provide an automated framework for generating such datasets. As per our evaluations, existing Multimodal Large Language Models (MLLMs) could predict whether an answer is correct, incorrect or partially correct with an accuracy of 55%. Similarly, they could predict whether the image provided in the student's answer is relevant or not with an accuracy of 75%. As per human experts, Pixtral was more aligned towards human judgement and values for biology and ChatGPT for physics and chemistry and achieved a score of 4 or more out of 5 in most parameters.


BurstGPT: A Real-world Workload Dataset to Optimize LLM Serving Systems

Abstract

arXiv:2401.17644v4 Announce Type: replace Abstract: Serving systems for Large Language Models (LLMs) are often optimized to improve quality of service (QoS) and throughput. However, due to the lack of open-source LLM serving workloads, these systems are frequently evaluated under unrealistic workload assumptions. Consequently, performance may degrade when systems are deployed in real-world scenarios. This work presents BurstGPT, an LLM serving workload with 10.31 million traces from regional Azure OpenAI GPT services over 213 days. BurstGPT captures LLM serving characteristics from user, model and system perspectives: (1) User request concurrency: burstiness variations of requests in Azure OpenAI GPT services, revealing diversified concurrency patterns in different services and model types. (2) User conversation patterns: counts and intervals within conversations for service optimizations. (3) Model response lengths: auto-regressive serving processes of GPT models, showing statistical relations between requests and their responses. (4) System response failures: failures of conversation and API services, showing intensive resource needs and limited availability of LLM services in Azure. The details of the characteristics can serve multiple purposes in LLM serving optimizations, such as system evaluation and trace provisioning. In our demo evaluation with BurstGPT, frequent variations in BurstGPT reveal declines in efficiency, stability, or reliability in realistic LLM serving. We identify that the generalization of KV cache management, scheduling and disaggregation optimizations can be improved under realistic workload evaluations. BurstGPT is publicly available now at https://github.com/HPMLL/BurstGPT and is widely used to develop prototypes of LLM serving frameworks in the industry.

摘要

大型语言模型(LLM)的服务系统通常以提升服务质量(QoS)和吞吐量为优化目标。然而,由于缺乏开源的LLM服务负载数据,这些系统常在非真实的负载假设下进行评估,导致实际部署时性能下降。本研究提出BurstGPT——一个包含213天内区域Azure OpenAI GPT服务1031万条追踪记录的LLM服务负载数据集。BurstGPT从用户、模型和系统三个维度捕捉LLM服务特征:(1)用户请求并发性:揭示Azure OpenAI GPT服务中请求突发性变化,展现不同服务与模型类型的多样化并发模式;(2)用户会话模式:统计会话内请求次数与间隔,为服务优化提供依据;(3)模型响应长度:呈现GPT模型自回归服务过程中的请求-响应统计关系;(4)系统响应故障:分析会话与API服务失败案例,反映Azure中LLM服务的高资源需求与有限可用性。这些特征细节可支持LLM服务优化的多场景应用,如系统评估与追踪配置。基于BurstGPT的演示评估表明,真实LLM服务中频繁的负载波动会导致效率、稳定性或可靠性下降。我们发现KV缓存管理通用性、调度优化与解耦设计在真实负载评估中存在改进空间。BurstGPT已开源发布于https://github.com/HPMLL/BurstGPT,目前正广泛应用于工业界LLM服务框架的原型开发。


Intermediate Languages Matter: Formal Choice Drives Neurosymbolic LLM Reasoning

Abstract

arXiv:2502.17216v2 Announce Type: replace Abstract: Large language models (LLMs) achieve astonishing results on a wide range of tasks. However, their formal reasoning ability still lags behind. A promising approach is Neurosymbolic LLM reasoning. It works by using LLMs as translators from natural to formal languages and symbolic solvers for deriving correct results. Still, it remains unclear what the contributing factors to the success of Neurosymbolic LLM reasoning are. This paper shows that one important factor is the choice of the formal language. By comparing 4 formal languages on 3 datasets over 6 LLMs, we show that the choice of formal language affects both the syntactic and the semantic reasoning capability. Thereby, we introduce the intermediate language challenge, which is the challenge of picking a suitable formal language for neurosymbolic reasoning. Further, we compare the effects of using different in-context-learning examples in an ablation study. We conclude that on average, context-aware encodings help LLMs to reason, while there is no apparent effect of using comments or markdown syntax.

摘要

大型语言模型(LLMs)在广泛任务中取得了惊人成果,但其形式化推理能力仍有不足。神经符号化LLM推理是一种前景广阔的方法,其通过将LLMs作为自然语言到形式化语言的翻译器,并利用符号求解器推导正确结果。然而,该方法的成功关键因素尚不明确。本文研究表明,形式化语言的选择是重要因素之一。通过在3个数据集上对比6种LLMs对4种形式化语言的表现,我们发现形式化语言的选择同时影响句法和语义推理能力。由此提出"中间语言挑战",即如何为神经符号化推理选择合适的形式化语言。此外,通过消融实验比较了不同上下文学习样本的影响,结果表明:平均而言,上下文感知编码有助于LLMs推理,而使用注释或标记语法则无明显效果。


NESTFUL: A Benchmark for Evaluating LLMs on Nested Sequences of API Calls

Abstract

arXiv:2409.03797v3 Announce Type: replace Abstract: The resurgence of autonomous agents built using large language models (LLMs) to solve complex real-world tasks has brought increased focus on LLMs' fundamental ability of tool or function calling. At the core of these agents, an LLM must plan, execute, and respond using external tools, APIs, and custom functions. Research on tool calling has gathered momentum, but evaluation benchmarks and datasets representing the complexity of the tasks have lagged behind. In this work, we focus on one such complexity, nested sequencing, with the goal of extending existing benchmarks and evaluation. Specifically, we present NESTFUL, a benchmark to evaluate LLMs on nested sequences of API calls, i.e., sequences where the output of one API call is passed as input to a subsequent call. NESTFUL contains 1800+ nested sequences where all the function calls are executable. Experimental results on a variety of models show that the best-performing model (GPT-4o) achieves a full sequence match accuracy of 28% and a win-rate of 60%, necessitating a large scope for improvement in the nested sequencing aspect of function calling. Our analysis of these results provides possible future research directions for the community, in addition to a benchmark to track progress. We have released the NESTFUL dataset under the Apache 2.0 license at https://github.com/IBM/NESTFUL.

摘要

基于大语言模型(LLMs)构建的自主智能体在解决复杂现实任务中的复兴,使人们更加关注LLMs工具或函数调用的基础能力。这些智能体的核心在于LLM必须通过外部工具、API和自定义函数进行规划、执行和响应。尽管工具调用的研究势头正盛,但能体现任务复杂性的评估基准和数据集却相对滞后。本研究聚焦于嵌套序列这一复杂性维度,旨在扩展现有基准与评估体系。具体而言,我们提出了NESTFUL基准,用于评估LLMs在嵌套API调用序列(即前一个API调用的输出作为后续调用输入的序列)上的表现。NESTFUL包含1800多个可执行函数调用的嵌套序列。多模型实验结果表明,性能最佳的模型(GPT-4o)仅达到28%的全序列匹配准确率和60%的胜率,表明函数调用的嵌套序列处理能力仍有巨大改进空间。除提供可追踪进展的基准外,我们对结果的分析也为学界指明了可能的未来研究方向。NESTFUL数据集已按Apache 2.0协议发布于https://github.com/IBM/NESTFUL。


Empowering the Deaf and Hard of Hearing Community: Enhancing Video Captions Using Large Language Models

Abstract

arXiv:2412.00342v2 Announce Type: replace Abstract: In today's digital age, video content is prevalent, serving as a primary source of information, education, and entertainment. However, the Deaf and Hard of Hearing (DHH) community often faces significant challenges in accessing video content due to the inadequacy of automatic speech recognition (ASR) systems in providing accurate and reliable captions. This paper addresses the urgent need to improve video caption quality by leveraging Large Language Models (LLMs). We present a comprehensive study that explores the integration of LLMs to enhance the accuracy and context-awareness of captions generated by ASR systems. Our methodology involves a novel pipeline that corrects ASR-generated captions using advanced LLMs. It explicitly focuses on models like GPT-3.5 and Llama2-13B due to their robust performance in language comprehension and generation tasks. We introduce a dataset representative of real-world challenges the DHH community faces to evaluate our proposed pipeline. Our results indicate that LLM-enhanced captions significantly improve accuracy, as evidenced by a notably lower Word Error Rate (WER) achieved by ChatGPT-3.5 (WER: 9.75%) compared to the original ASR captions (WER: 23.07%), ChatGPT-3.5 shows an approximate 57.72% improvement in WER compared to the original ASR captions.

摘要

在当今数字时代,视频内容作为信息、教育和娱乐的主要载体已无处不在。然而,由于自动语音识别(ASR)系统生成的字幕在准确性和可靠性方面存在不足,聋哑及听力障碍(DHH)群体在获取视频内容时常面临重大挑战。本文针对利用大语言模型(LLM)提升视频字幕质量的迫切需求展开研究,提出了一种通过整合LLM来增强ASR生成字幕的准确性与语境感知能力的综合方案。我们设计了一种创新流程,采用GPT-3.5和Llama2-13B等具有卓越语言理解与生成能力的先进LLM来修正ASR生成的字幕。为评估该流程效果,我们构建了反映DHH群体现实困境的数据集。实验结果表明:经LLM增强的字幕准确性显著提升,ChatGPT-3.5实现的词错误率(WER: 9.75%)较原始ASR字幕(WER: 23.07%)降低约57.72%,这一数据有力验证了该方法的有效性。


ZEBRA: Leveraging Model-Behavioral Knowledge for Zero-Annotation Preference Dataset Construction

Abstract

arXiv:2502.18744v2 Announce Type: replace Abstract: Recent efforts in LLM alignment have focused on constructing large-scale preference datasets via human or Artificial Intelligence (AI) annotators. However, such approaches rely on instance-wise supervision, incurring substantial annotation cost and limited interpretability. In this paper, we propose ZEBRA - a model behavior-wise zero-annotation framework that constructs preference data by leveraging model behavior knowledge derived from benchmark performances. ZEBRA binarizes response pairs by evaluating the quality and similarity of their origin models, entirely bypassing instance-level annotation. This allows scalable, controllable, and cost-effective alignment data generation. Empirical results show that ZEBRA achieves alignment performance comparable to instance-supervised methods, despite requiring no manual or model-based labeling.

摘要

近期大语言模型对齐研究主要集中于通过人工或人工智能标注者构建大规模偏好数据集。然而,这种方法依赖实例级监督,存在标注成本高昂且可解释性有限的问题。本文提出ZEBRA框架——一种基于模型行为知识的零标注方法,通过利用基准测试表现衍生的模型行为知识构建偏好数据。ZEBRA通过评估原始模型的质量和相似性对响应进行二值化处理,完全绕过实例级标注,实现了可扩展、可控且低成本的比对数据生成。实验结果表明,尽管无需人工或模型标注,ZEBRA仍能取得与实例监督方法相当的对齐性能。


SQL-o1: A Self-Reward Heuristic Dynamic Search Method for Text-to-SQL

Abstract

arXiv:2502.11741v2 Announce Type: replace Abstract: Text-to-SQL (Text2SQL) aims to map natural language questions to executable SQL queries. Although large language models (LLMs) have driven significant progress, current approaches struggle with poor transferability to open-source LLMs, limited robustness against logic and function errors in complex queries, and inefficiencies in structured search. We introduce SQL-o1, a self-reward-driven heuristic search framework built on an agent-based architecture to enhance model reasoning capabilities. SQL-o1 leverages Monte Carlo Tree Search (MCTS) for structured, multi-step exploration, and incorporates a dynamic pruning strategy to accelerate inference without sacrificing accuracy. On the Spider and Bird benchmarks, SQL-o1 achieves a +10.8 execution accuracy improvement on the complex Bird dataset, surpassing even GPT-4-based models. Notably, it exhibits strong few-shot generalization and robust cross-model transferability across open-source LLMs. Our code is available at:https://github.com/ShuaiLyu0110/SQL-o1.

摘要

文本到SQL(Text2SQL)旨在将自然语言问题映射为可执行的SQL查询。尽管大语言模型(LLM)推动了显著进展,但现有方法仍面临以下问题:对开源LLM的迁移性较差、针对复杂查询中逻辑和功能错误的鲁棒性有限,以及结构化搜索效率低下。我们提出了SQL-o1,这是一个基于智能体架构的自奖励驱动启发式搜索框架,旨在增强模型推理能力。SQL-o1利用蒙特卡洛树搜索(MCTS)进行结构化多步探索,并采用动态剪枝策略以在不牺牲准确性的前提下加速推理。在Spider和Bird基准测试中,SQL-o1在复杂Bird数据集上实现了+10.8%的执行准确率提升,甚至超越了基于GPT-4的模型。值得注意的是,该方法在开源LLM间展现出强大的少样本泛化能力和稳健的跨模型迁移性。我们的代码发布于:https://github.com/ShuaiLyu0110/SQL-o1。


SQLCritic: Correcting Text-to-SQL Generation via Clause-wise Critic

Abstract

arXiv:2503.07996v4 Announce Type: replace Abstract: Existing refinement methods in LLM-based Text-to-SQL systems exhibit limited effectiveness. They often introduce new errors during the self-correction process and fail to detect and correct semantic inaccuracies. To address these gaps, we first introduce a clause-wise critique generation task along with a benchmark, SQLCriticBench, which performs fine-grained error localization including both syntax and semantic errors at the clause level. Furthermore, we introduce a variant of DPO for training our SQLCritic model, where the β\beta coefficient is adaptively changed according to the clause-level inconsistencies between the preferred and dispreferred critiques. We also propose an automatically training dataset curation pipeline which annotate clause-wise critique at scale in a cost-effective way. Experiments demonstrate that the SQLCritic model significantly improves SQL accuracy on the BIRD and Spider datasets, and the results on SQLCriticBench further reveals its superior critique capabilities compared to existing models.

摘要

现有基于大语言模型的文本到SQL系统优化方法效果有限,这些方法在自我修正过程中常引入新错误,且无法有效检测和修正语义错误。为解决这些问题,我们首先提出子句级批评生成任务并建立SQLCriticBench基准,该基准能在子句层面实现细粒度错误定位(包括语法和语义错误)。此外,我们引入一种动态DPO变体训练SQLCritic模型,其中β系数会根据偏好与非偏好批评间的子句级不一致性进行自适应调整。我们还提出自动化训练数据构建流程,以经济高效的方式大规模标注子句级批评。实验表明,SQLCritic模型显著提升了BIRD和Spider数据集上的SQL准确率,在SQLCriticBench上的结果进一步证实其批评能力优于现有模型。


Property Enhanced Instruction Tuning for Multi-task Molecule Generation with Large Language Models

Abstract

arXiv:2412.18084v4 Announce Type: replace Abstract: Large language models (LLMs) are widely applied in various natural language processing tasks such as question answering and machine translation. However, due to the lack of labeled data and the difficulty of manual annotation for biochemical properties, the performance for molecule generation tasks is still limited, especially for tasks involving multi-properties constraints. In this work, we present a two-step framework PEIT (Property Enhanced Instruction Tuning) to improve LLMs for molecular-related tasks. In the first step, we use textual descriptions, SMILES, and biochemical properties as multimodal inputs to pre-train a model called PEIT-GEN, by aligning multi-modal representations to synthesize instruction data. In the second step, we fine-tune existing open-source LLMs with the synthesized data, the resulting PEIT-LLM can handle molecule captioning, text-based molecule generation, molecular property prediction, and our newly proposed multi-constraint molecule generation tasks. Experimental results show that our pre-trained PEIT-GEN outperforms MolT5 and BioT5 in molecule captioning, demonstrating modalities align well between textual descriptions, structures, and biochemical properties. Furthermore, PEIT-LLM shows promising improvements in multi-task molecule generation, proving the scalability of the PEIT framework for various molecular tasks. We release the code, constructed instruction data, and model checkpoints in https://github.com/chenlong164/PEIT.

摘要

大型语言模型(LLMs)已被广泛应用于问答和机器翻译等自然语言处理任务。然而,由于标记数据的缺乏以及生化特性人工标注的困难,其在分子生成任务中的表现仍存在局限,特别是在涉及多重属性约束的任务中。本研究提出了一个两阶段框架PEIT(属性增强指令微调)以提升LLMs在分子相关任务中的性能。第一阶段,我们通过对齐多模态表征来合成指令数据,以文本描述、SMILES分子式和生化特性作为多模态输入,预训练得到PEIT-GEN模型。第二阶段,我们利用合成数据对现有开源LLMs进行微调,所得PEIT-LLM可处理分子描述、基于文本的分子生成、分子属性预测以及我们新提出的多约束分子生成任务。实验结果表明,预训练的PEIT-GEN在分子描述任务上优于MolT5和BioT5,证实了文本描述、分子结构与生化特性间的模态对齐效果。此外,PEIT-LLM在多任务分子生成中展现出显著提升,证明了PEIT框架对不同分子任务的可扩展性。我们在https://github.com/chenlong164/PEIT公开了代码、构建的指令数据及模型检查点。


SPD: Sync-Point Drop for efficient tensor parallelism of Large Language Models

Abstract

arXiv:2502.20727v3 Announce Type: replace Abstract: With the rapid expansion in the scale of large language models (LLMs), enabling efficient distributed inference across multiple computing units has become increasingly critical. However, communication overheads from popular distributed inference techniques such as Tensor Parallelism pose a significant challenge to achieve scalability and low latency. Therefore, we introduce a novel optimization technique, Sync-Point Drop (SPD), to reduce communication overheads in tensor parallelism by selectively dropping synchronization on attention outputs. In detail, we first propose a block design that allows execution to proceed without communication through SPD. Second, we apply different SPD strategies to attention blocks based on their sensitivity to the model accuracy. The proposed methods effectively alleviate communication bottlenecks while minimizing accuracy degradation during LLM inference, offering a scalable solution for diverse distributed environments: SPD offered about 20% overall inference latency reduction with < 1% accuracy regression for LLaMA2-70B inference over 8 GPUs.

摘要

随着大型语言模型(LLMs)规模的快速扩张,实现跨多个计算单元的高效分布式推理变得愈发关键。然而,诸如张量并行等主流分布式推理技术带来的通信开销,对实现可扩展性和低延迟构成了重大挑战。为此,我们提出了一种新颖的优化技术——同步点丢弃(SPD),通过选择性忽略注意力输出的同步来降低张量并行中的通信开销。具体而言,我们首先设计了一种支持通过SPD实现无通信执行的模块架构;其次,我们根据不同注意力模块对模型精度的敏感程度,为其应用差异化的SPD策略。所提方法在LLM推理过程中有效缓解了通信瓶颈,同时将精度损失降至最低,为多样化分布式环境提供了可扩展的解决方案:在8个GPU上运行LLaMA2-70B模型时,SPD实现了约20%的整体推理延迟降低,且精度损失小于1%。


VRoPE: Rotary Position Embedding for Video Large Language Models

Abstract

arXiv:2502.11664v2 Announce Type: replace Abstract: Rotary Position Embedding (RoPE) has shown strong performance in text-based Large Language Models (LLMs), but extending it to video remains a challenge due to the intricate spatiotemporal structure of video frames. Existing adaptations, such as RoPE-3D, attempt to encode spatial and temporal dimensions separately but suffer from two major limitations: positional bias in attention distribution and disruptions in video-text transitions. To overcome these issues, we propose Video Rotary Position Embedding (VRoPE), a novel positional encoding method tailored for Video-LLMs. Specifically, we introduce a more balanced encoding strategy that mitigates attention biases, ensuring a more uniform distribution of spatial focus. Additionally, our approach restructures positional indices to ensure a smooth transition between video and text tokens. Extensive experiments on different models demonstrate that VRoPE consistently outperforms previous RoPE variants, achieving significant improvements in video understanding, temporal reasoning, and retrieval tasks. Code will be available at https://github.com/johncaged/VRoPE.

摘要

旋转位置编码(RoPE)在基于文本的大语言模型(LLM)中表现出色,但由于视频帧复杂的时空结构,将其扩展至视频领域仍面临挑战。现有改进方法(如RoPE-3D)尝试分别编码空间和时间维度,但存在两个主要缺陷:注意力分布的位置偏差以及视频-文本转换的干扰。为解决这些问题,我们提出视频旋转位置编码(VRoPE),这是一种专为视频大语言模型设计的新型位置编码方法。具体而言,我们引入更平衡的编码策略以减轻注意力偏差,确保空间关注更均匀分布;同时通过重构位置索引实现视频与文本标记间的平滑过渡。在不同模型上的大量实验表明,VRoPE始终优于现有RoPE变体,在视频理解、时序推理和检索任务中取得显著提升。代码将在https://github.com/johncaged/VRoPE发布。


MV-MATH: Evaluating Multimodal Math Reasoning in Multi-Visual Contexts

Abstract

arXiv:2502.20808v5 Announce Type: replace Abstract: Multimodal Large Language Models (MLLMs) have shown promising capabilities in mathematical reasoning within visual contexts across various datasets. However, most existing multimodal math benchmarks are limited to single-visual contexts, which diverges from the multi-visual scenarios commonly encountered in real-world mathematical applications. To address this gap, we introduce MV-MATH: a meticulously curated dataset of 2,009 high-quality mathematical problems. Each problem integrates multiple images interleaved with text, derived from authentic K-12 scenarios, and enriched with detailed annotations. MV-MATH includes multiple-choice, free-form, and multi-step questions, covering 11 subject areas across 3 difficulty levels, and serves as a comprehensive and rigorous benchmark for assessing MLLMs' mathematical reasoning in multi-visual contexts. Through extensive experimentation, we observe that MLLMs encounter substantial challenges in multi-visual math tasks, with a considerable performance gap relative to human capabilities on MV-MATH. Furthermore, we analyze the performance and error patterns of various models, providing insights into MLLMs' mathematical reasoning capabilities within multi-visual settings.

摘要

多模态大语言模型(MLLMs)在各类数据集的视觉情境数学推理任务中展现出显著潜力。然而现有大多数多模态数学基准仅局限于单一视觉场景,这与现实数学应用中常见的多视觉情境存在明显差异。为填补这一空白,我们提出MV-MATH数据集:一个包含2,009道高质量数学题目的精编集合。每道题目融合了文本与多幅交错排列的图像,素材源自真实K-12教育场景,并配有精细标注。该数据集涵盖选择题、开放题及多步骤问题,包含3个难度层级下的11个学科领域,为评估MLLMs在多视觉情境下的数学推理能力提供了全面而严谨的基准。通过大量实验发现,MLLMs在多视觉数学任务中面临显著挑战,其在MV-MATH上的表现与人类能力存在明显差距。此外,我们分析了不同模型的性能表现与错误模式,从而揭示多视觉环境下MLLMs数学推理能力的内在机制。


Abstract

arXiv:2503.10619v4 Announce Type: replace Abstract: We introduce Tempest, a multi-turn adversarial framework that models the gradual erosion of Large Language Model (LLM) safety through a tree search perspective. Unlike single-turn jailbreaks that rely on one meticulously engineered prompt, Tempest expands the conversation at each turn in a breadth-first fashion, branching out multiple adversarial prompts that exploit partial compliance from previous responses. By tracking these incremental policy leaks and re-injecting them into subsequent queries, Tempest reveals how minor concessions can accumulate into fully disallowed outputs. Evaluations on the JailbreakBench dataset show that Tempest achieves a 100% success rate on GPT-3.5-turbo and 97% on GPT-4 in a single multi-turn run, using fewer queries than baselines such as Crescendo or GOAT. This tree search methodology offers an in-depth view of how model safeguards degrade over successive dialogue turns, underscoring the urgency of robust multi-turn testing procedures for language models.

摘要

我们提出Tempest——一种多轮对抗框架,该框架通过树搜索视角模拟大型语言模型(LLM)安全性的渐进式侵蚀过程。与依赖单轮精细设计提示的单次越狱攻击不同,Tempest采用广度优先策略在每轮对话中扩展对抗性提示分支,利用模型对先前响应的部分服从性生成多重对抗提示。通过追踪这些渐进式的策略泄漏并将其重新注入后续查询,Tempest揭示了微小让步如何累积成完全违规的输出。在JailbreakBench数据集上的评估表明,Tempest单次多轮攻击对GPT-3.5-turbo达成100%成功率,对GPT-4达到97%,且所需查询量少于Crescendo或GOAT等基线方法。这种树搜索方法深入展现了模型安全机制在连续对话轮次中的退化过程,凸显了建立鲁棒的多轮测试流程对语言模型的紧迫性。


Beyond A Single AI Cluster: A Survey of Decentralized LLM Training

Abstract

arXiv:2503.11023v2 Announce Type: replace Abstract: The emergence of large language models (LLMs) has revolutionized AI development, yet the resource demands beyond a single cluster or even datacenter, limiting accessibility to well-resourced organizations. Decentralized training has emerged as a promising paradigm to leverage dispersed resources across clusters, datacenters and regions, offering the potential to democratize LLM development for broader communities. As the first comprehensive exploration of this emerging field, we present decentralized LLM training as a resource-driven paradigm and categorize existing efforts into community-driven and organizational approaches. We further clarify this through: (1) a comparison with related paradigms, (2) a characterization of decentralized resources, and (3) a taxonomy of recent advancements. We also provide up-to-date case studies and outline future directions to advance research in decentralized LLM training.

摘要

大型语言模型(LLMs)的兴起彻底改变了人工智能的发展,但其资源需求已超出单一集群甚至数据中心的范围,使得只有资源充足的组织能够参与。去中心化训练作为一种新兴范式,能够利用跨集群、数据中心和地区的分散资源,为更广泛的群体提供参与LLM开发的可能性。作为对这一新兴领域的首次全面探索,我们将去中心化LLM训练定义为一种资源驱动范式,并将现有研究分为社区驱动和组织驱动两类。我们通过以下三个方面进一步阐明这一范式:(1)与相关范式的比较,(2)去中心化资源的特征描述,以及(3)对近期进展的分类梳理。此外,我们还提供了最新的案例研究,并展望了未来研究方向,以推动去中心化LLM训练领域的进展。


An Empirical Study of LLM Reasoning Ability Under Strict Output Length Constraint

Abstract

arXiv:2504.14350v3 Announce Type: replace Abstract: Recent work has demonstrated the remarkable potential of Large Language Models (LLMs) in test-time scaling. By making models think before answering, they are able to achieve much higher accuracy with extra inference computation. However, in many real-world scenarios, models are used under time constraints, where an answer should be given within a certain output length. It is unclear whether and how the reasoning ability of different LLMs remain effective under strict constraints. We take a first look at this problem by conducting an in-depth empirical study. Specifically, we test 30 LLMs on common reasoning datasets under a wide range of output length budgets, and we analyze the correlation between the inference accuracy and various properties including model type, model size, prompt style, etc. We also consider the mappings between token budgets and actual on-device latency budgets. The results have demonstrated several interesting findings regarding the budget-aware LLM reasoning ability that differ from the unconstrained situation, e.g. the optimal choices of either model size or prompt style change under different budgets. These findings offer timely evaluation to this area and practical guidance for users to deploy LLMs under real-world latency constraints.

摘要

近期研究表明,大型语言模型(LLMs)在测试时扩展方面展现出显著潜力。通过让模型在回答前进行思考,它们能够通过额外的推理计算实现更高准确率。然而,在许多实际应用场景中,模型需在时间约束下运行,即必须在特定输出长度内给出答案。目前尚不清楚不同LLMs的推理能力在严格约束条件下是否及如何保持有效。我们通过深入的实证研究首次探讨了这一问题。具体而言,我们在多种输出长度限制下测试了30个LLMs在常见推理数据集上的表现,并分析了推理准确率与模型类型、模型规模、提示风格等特性之间的相关性。同时,我们还考察了标记预算与实际设备端延迟预算之间的映射关系。研究结果揭示了若干与无约束条件下不同的预算感知LLM推理能力现象,例如模型规模或提示风格的最优选择会随预算变化而改变。这些发现为该领域提供了及时的评估依据,并为用户在现实延迟约束下部署LLMs提供了实用指导。


Enhancing Mathematical Reasoning in Large Language Models with Self-Consistency-Based Hallucination Detection

Abstract

arXiv:2504.09440v2 Announce Type: replace Abstract: Large language models (LLMs) have demonstrated strong mathematical reasoning capabilities but remain susceptible to hallucinations producing plausible yet incorrect statements especially in theorem proving, symbolic manipulation, and numerical computation. While self-consistency (SC) has been explored as a means to improve factuality in LLMs, existing approaches primarily apply SC to final-answer selection, neglecting the logical consistency of intermediate reasoning steps. In this work, we introduce a structured self-consistency framework designed to enhance the reliability of mathematical reasoning. Our method enforces self-consistency across intermediate steps and final outputs, reducing logical inconsistencies and hallucinations. We evaluate our approach across three core mathematical tasks: theorem proving, symbolic transformation, and numerical computation. Experimental results demonstrate that SC significantly improves proof validity, symbolic reasoning accuracy, and numerical stability while maintaining computational efficiency. Further analysis reveals that structured self-consistency not only enhances problem-solving accuracy but also reduces the variance of model-generated outputs. These findings highlight self-consistency as a robust mechanism for improving mathematical reasoning in LLMs, paving the way for more reliable and interpretable AI-driven mathematics.

摘要

大语言模型(LLMs)已展现出强大的数学推理能力,但在定理证明、符号运算和数值计算等任务中仍易产生幻觉,生成看似合理实则错误的陈述。虽然自洽性(SC)已被探索作为提升LLMs事实准确性的手段,但现有方法主要将SC应用于最终答案选择,而忽视了中间推理步骤的逻辑一致性。本研究提出了一种结构化自洽性框架,旨在增强数学推理的可靠性。该方法通过在中间步骤与最终输出间强制保持自洽性,减少逻辑不一致与幻觉现象。我们在三大核心数学任务(定理证明、符号转换和数值计算)上评估了本方法的性能。实验结果表明,自洽性显著提高了证明有效性、符号推理准确性和数值稳定性,同时保持了计算效率。进一步分析表明,结构化自洽性不仅能提升问题求解精度,还可降低模型输出的方差。这些发现揭示了自洽性作为改进LLMs数学推理的强健机制,为构建更可靠、可解释的AI驱动数学方法铺平了道路。


Towards Machine-Generated Code for the Resolution of User Intentions

Abstract

arXiv:2504.17531v2 Announce Type: replace Abstract: The growing capabilities of Artificial Intelligence (AI), particularly Large Language Models (LLMs), prompt a reassessment of the interaction mechanisms between users and their devices. Currently, users are required to use a set of high-level applications to achieve their desired results. However, the advent of AI may signal a shift in this regard, as its capabilities have generated novel prospects for user-provided intent resolution through the deployment of model-generated code. This development represents a significant progression in the realm of hybrid workflows, where human and artificial intelligence collaborate to address user intentions, with the former responsible for defining these intentions and the latter for implementing the solutions to address them. In this paper, we investigate the feasibility of generating and executing workflows through code generation that results from prompting an LLM with a concrete user intention, and a simplified application programming interface for a GUI-less operating system. We provide an in-depth analysis and comparison of various user intentions, the resulting code, and its execution. The findings demonstrate the general feasibility of our approach and that the employed LLM, GPT-4o-mini, exhibits remarkable proficiency in the generation of code-oriented workflows in accordance with provided user intentions.

摘要

人工智能(AI),尤其是大语言模型(LLM)的日益强大,促使我们重新审视用户与设备之间的交互机制。当前,用户需要通过一系列高级应用程序来实现预期目标。然而,AI的出现可能标志着这一模式的转变——其能力为基于模型生成代码的用户意图解析开辟了新途径。这一进展代表了混合工作流领域的重大进步:人类负责定义意图,而AI则负责实现解决方案。本文研究了通过代码生成创建并执行工作流的可行性,该代码由具体用户意图驱动的LLM提示词,以及无图形界面操作系统的简化应用程序接口所生成。我们对各类用户意图、生成代码及其执行效果进行了深入分析与比较。研究结果表明:该方法总体具备可行性,且所使用的GPT-4o-mini模型在根据给定用户意图生成代码导向型工作流方面表现出卓越能力。


Uncertainty quantification in fine-tuned LLMs using LoRA ensembles

Abstract

arXiv:2402.12264v2 Announce Type: replace-cross Abstract: Fine-tuning large language models can improve task specific performance, although a general understanding of what the fine-tuned model has learned, forgotten and how to trust its predictions is still missing. We derive principled uncertainty quantification for fine-tuned LLMs with posterior approximations using computationally efficient low-rank adaptation ensembles. We analyze three common multiple-choice datasets using low-rank adaptation ensembles based on Mistral-7b, and draw quantitative and qualitative conclusions on their perceived complexity and balance between retained prior knowledge and domain specific adaptation during and after fine-tuning. We identify unexpected retention of acquired knowledge during fine-tuning in the overfitting regime.

摘要

尽管对微调后模型所学内容、遗忘机制及其预测可信度仍缺乏普遍认知,但微调大语言模型确实能提升特定任务性能。我们通过计算高效的低秩自适应集成方法,推导出基于后验近似的微调LLMs原则性不确定性量化框架。基于Mistral-7b的低秩自适应集成,我们分析了三个常用多选题数据集,定量与定性地评估了它们在微调过程中及完成后所体现的感知复杂度、先验知识保留与领域自适应间的平衡关系。研究发现,在过拟合状态下,微调过程中会出现意外持续的知识保留现象。


A Framework for Real-time Safeguarding the Text Generation of Large Language Model

Abstract

arXiv:2404.19048v3 Announce Type: replace-cross Abstract: Large Language Models (LLMs) have significantly advanced natural language processing (NLP) tasks but also pose ethical and societal risks due to their propensity to generate harmful content. Existing methods have limitations, including the need for training specific control models and proactive intervention during text generation, that lead to quality degradation and increased computational overhead. To mitigate those limitations, we propose LLMSafeGuard, a lightweight real-time framework that integrates an external validator into decoding, rejecting unsafe outputs while allowing valid ones. We introduce a similarity-based validation approach, simplifying constraint introduction and eliminating the need for control model training. Additionally, LLMSafeGuard employs a context-wise timing selection strategy, intervening LLMs only when necessary. We evaluate LLMSafeGuard on detoxification and copyright safeguarding, demonstrating its superiority over SOTA baselines. In detoxification, LLMSafeGuard reduces toxic output by at least 38.6% while preserving linguistic quality. Additionally, its context-wise timing selection cuts inference time by at least 24.2% without compromising effectiveness.

摘要

大型语言模型(LLMs)在自然语言处理(NLP)任务中取得显著进展,但其生成有害内容的倾向也带来了伦理和社会风险。现有方法存在局限性,包括需要训练特定控制模型以及在文本生成过程中进行主动干预,导致质量下降和计算开销增加。为缓解这些局限,我们提出LLMSafeGuard——一种轻量级实时框架,通过将外部验证器集成至解码过程,拒绝不安全输出同时保留有效内容。我们引入基于相似性的验证方法,简化约束条件的引入并无需训练控制模型。此外,LLMSafeGuard采用上下文感知的时机选择策略,仅在必要时干预LLMs。我们在去毒性和版权保护任务上评估LLMSafeGuard,证明其优于当前最先进基线模型。在去毒性任务中,LLMSafeGuard至少降低38.6%的有毒输出且保持语言质量;其上下文感知时机选择策略在不影响效果的前提下,至少减少24.2%的推理时间。


Exploring the Robustness of Language Models for Tabular Question Answering via Attention Analysis

Abstract

arXiv:2406.12719v3 Announce Type: replace-cross Abstract: Large Language Models (LLMs), already shown to ace various text comprehension tasks, have also remarkably been shown to tackle table comprehension tasks without specific training. Building on earlier studies of LLMs for tabular tasks, we probe how in-context learning (ICL), model scale, instruction tuning, and domain bias affect Tabular QA (TQA) robustness by testing LLMs, under diverse augmentations and perturbations, on diverse domains: Wikipedia-based \textbf&#123;WTQ&#125;, financial \textbf&#123;TAT-QA&#125;, and scientific \textbf&#123;SCITAB&#125;. Although instruction tuning and larger, newer LLMs deliver stronger, more robust TQA performance, data contamination and reliability issues, especially on \textbf&#123;WTQ&#125;, remain unresolved. Through an in-depth attention analysis, we reveal a strong correlation between perturbation-induced shifts in attention dispersion and the drops in performance, with sensitivity peaking in the model's middle layers. We highlight the need for improved interpretable methodologies to develop more reliable LLMs for table comprehension.

摘要

大型语言模型(LLMs)在各类文本理解任务中已展现出卓越能力,值得注意的是,它们无需专门训练即可处理表格理解任务。基于先前针对表格任务的LLMs研究,我们通过在多领域数据集(基于维基百科的\textbf&#123;WTQ&#125;、金融领域\textbf&#123;TAT-QA&#125;和科学领域\textbf&#123;SCITAB&#125;)上实施多样化增强与扰动测试,探究了上下文学习(ICL)、模型规模、指令微调及领域偏差如何影响表格问答(TQA)的鲁棒性。尽管指令微调和更大规模的新版LLMs能提供更强健的TQA性能,但数据污染和可靠性问题(尤其在\textbf&#123;WTQ&#125;上)仍未解决。通过深入的注意力机制分析,我们发现扰动引起的注意力分布变化与性能下降存在强相关性,且模型中间层的敏感性达到峰值。本研究强调需要开发更具可解释性的方法论,以构建更可靠的表格理解LLMs。


An In-Depth Investigation of Data Collection in LLM App Ecosystems

Abstract

arXiv:2408.13247v2 Announce Type: replace-cross Abstract: LLM app (tool) ecosystems are rapidly evolving to support sophisticated use cases that often require extensive user data collection. Given that LLM apps are developed by third parties and anecdotal evidence indicating inconsistent enforcement of policies by LLM platforms, sharing user data with these apps presents significant privacy risks. In this paper, we aim to bring transparency in data practices of LLM app ecosystems. We examine OpenAI's GPT app ecosystem as a case study. We propose an LLM-based framework to analyze the natural language specifications of GPT Actions (custom tools) and assess their data collection practices. Our analysis reveals that Actions collect excessive data across 24 categories and 145 data types, with third-party Actions collecting 6.03% more data on average. We find that several Actions violate OpenAI's policies by collecting sensitive information, such as passwords, which is explicitly prohibited by OpenAI. Lastly, we develop an LLM-based privacy policy analysis framework to automatically check the consistency of data collection by Actions with disclosures in their privacy policies. Our measurements indicate that the disclosures for most of the collected data types are omitted, with only 5.8% of Actions clearly disclosing their data collection practices.

摘要

LLM应用(工具)生态系统正在快速发展,以支持通常需要大量用户数据收集的复杂用例。鉴于LLM应用由第三方开发,且有轶事证据表明LLM平台对政策的执行不一致,与这些应用共享用户数据存在重大隐私风险。本文旨在揭示LLM应用生态系统的数据实践透明度。我们以OpenAI的GPT应用生态系统为案例进行研究,提出了一种基于LLM的框架,用于分析GPT Actions(自定义工具)的自然语言规范并评估其数据收集实践。我们的分析表明,Actions在24个类别和145种数据类型中收集了过量数据,其中第三方Actions平均多收集6.03%的数据。研究发现,部分Actions违反OpenAI政策,收集了密码等敏感信息(此类行为被OpenAI明确禁止)。最后,我们开发了一种基于LLM的隐私政策分析框架,用于自动检查Actions的数据收集行为与其隐私政策披露内容的一致性。测量结果显示,大多数收集数据类型的披露信息存在遗漏,仅有5.8%的Actions明确披露了其数据收集实践。


Fine-tuning Large Language Models for Entity Matching

Abstract

arXiv:2409.08185v2 Announce Type: replace-cross Abstract: Generative large language models (LLMs) are a promising alternative to pre-trained language models for entity matching due to their high zero-shot performance and ability to generalize to unseen entities. Existing research on using LLMs for entity matching has focused on prompt engineering and in-context learning. This paper explores the potential of fine-tuning LLMs for entity matching. We analyze fine-tuning along two dimensions: 1) the representation of training examples, where we experiment with adding different types of LLM-generated explanations to the training set, and 2) the selection and generation of training examples using LLMs. In addition to the matching performance on the source dataset, we investigate how fine-tuning affects the models ability to generalize to other in-domain datasets as well as across topical domains. Our experiments show that fine-tuning significantly improves the performance of the smaller models while the results for the larger models are mixed. Fine-tuning also improves the generalization to in-domain datasets while hurting cross-domain transfer. We show that adding structured explanations to the training set has a positive impact on the performance of three out of four LLMs, while the proposed example selection and generation methods, only improve the performance of Llama 3.1 8B while decreasing the performance of GPT-4o-mini.

摘要

生成式大语言模型(LLM)因其出色的零样本性能和对未见实体的泛化能力,成为预训练语言模型在实体匹配任务中的有前景替代方案。现有关于LLM实体匹配的研究主要集中于提示工程和上下文学习。本文探索了微调LLM在实体匹配中的潜力,从两个维度展开分析:1)训练样本的表示形式——通过实验验证在训练集中添加不同类型的LLM生成解释的效果;2)利用LLM进行训练样本的选择与生成。除源数据集上的匹配性能外,我们还研究了微调如何影响模型对同领域其他数据集及跨主题领域的泛化能力。实验表明,微调显著提升了较小模型的性能,但对较大模型的效果参差不齐。微调能提升同领域数据集的泛化性能,却会削弱跨领域迁移能力。研究发现,在训练集中加入结构化解释对四分之三的LLM性能产生积极影响,而提出的样本选择与生成方法仅提升了Llama 3.1 8B的性能,同时降低了GPT-4o-mini的表现。


dMel: Speech Tokenization made Simple

Abstract

arXiv:2407.15835v3 Announce Type: replace-cross Abstract: Large language models have revolutionized natural language processing by leveraging self-supervised pretraining on vast textual data. Inspired by this success, researchers have investigated various compression-based speech tokenization methods to discretize continuous speech signals, enabling the application of language modeling techniques to discrete tokens. However, audio compressor introduces additional complexity and computational cost, and often fail on out-of-domain audio signals. In this work, we introduce a novel speech representation (dmel) that discretizes mel-filterbank channels into intensity bins, creating a simpler yet more effective representation compared to existing speech tokenization methods. Our approach demonstrates superior performance in preserving audio content, robustness to out-of-domain data, and offers a training-free, natural, and streamable representation. To address the high-dimensional nature of log-mel spectrograms, we propose an efficient parallel encoding and decoding method for high-dimensional tokens using an LM-style transformer architecture. This innovation enables us to develop RichTTS and RichASR, two models sharing the same architecture while achieving comparable or better results than specialized existing methods. Our results demonstrate the effectiveness of dmel in achieving high performance on both speech synthesis and recognition tasks within a unified framework, paving the way for efficient and effective joint modeling of speech and text.

摘要

大型语言模型通过在海量文本数据上进行自监督预训练,彻底改变了自然语言处理领域。受此成功启发,研究者们探索了多种基于压缩的语音标记化方法,将连续语音信号离散化,使得语言建模技术能够应用于离散标记。然而,音频压缩器会引入额外的复杂性和计算成本,且在处理域外音频信号时往往表现不佳。本研究提出了一种新颖的语音表征方法(dmel),该方法将梅尔滤波器组通道离散化为强度区间,相比现有语音标记化方法,创造了一种更简单却更有效的表征方式。我们的方法在保留音频内容、对域外数据的鲁棒性方面展现出卓越性能,同时提供了无需训练、自然且可流式传输的表征方案。针对对数梅尔频谱图的高维特性,我们提出了一种基于LM风格Transformer架构的高效并行编解码方法。这一创新使我们开发出RichTTS和RichASR两个模型,它们共享相同架构,却在语音合成和识别任务上达到或超越了现有专用方法的性能。实验结果表明,dmel表征在统一框架下能同时实现语音合成与识别任务的高性能,为语音与文本的高效联合建模开辟了新途径。


DPO Meets PPO: Reinforced Token Optimization for RLHF

Abstract

arXiv:2404.18922v4 Announce Type: replace-cross Abstract: In the classical Reinforcement Learning from Human Feedback (RLHF) framework, Proximal Policy Optimization (PPO) is employed to learn from sparse, sentence-level rewards -- a challenging scenario in traditional deep reinforcement learning. Despite the great successes of PPO in the alignment of large language models, its open-source implementation is still largely sub-optimal. To address these issues, we introduce a framework that models RLHF problems as a Markov decision process (MDP), enabling the capture of fine-grained token-wise information. Under this framework, we introduce an algorithm Reinforced Token Optimization (\texttt{RTO}), which learns the token-wise reward function from preference data and performs policy optimization based on this learned token-wise reward signal. Theoretically, \texttt{RTO} is proven to have the capability of finding the near-optimal policy sample-efficiently. For its practical implementation, \texttt{RTO} innovatively integrates Direct Preference Optimization (DPO) and PPO. DPO, originally derived from sparse sentence rewards, surprisingly provides us with a token-wise characterization of response quality, which is seamlessly incorporated into our subsequent PPO training stage. Extensive experiments demonstrate that \texttt{RTO} performs better than PPO and other direct preference learning algorithms. In particular, RTO outperforms PPO by 7.5 points on the AlpacaEval 2 benchmark and by 4.1 points on Arena-Hard. Our code and models are available at \href{https://github.com/zkshan2002/RTO&#125;&#123;https://github.com/zkshan2002/RTO&#125;.

摘要

在经典的基于人类反馈的强化学习(RLHF)框架中,近端策略优化(PPO)被用于从稀疏的句子级奖励中学习——这是传统深度强化学习中的一个具有挑战性的场景。尽管PPO在大型语言模型对齐方面取得了巨大成功,但其开源实现仍存在较大优化空间。针对这些问题,我们提出了一个将RLHF问题建模为马尔可夫决策过程(MDP)的框架,从而能够捕获细粒度的词元级信息。在此框架下,我们提出了一种强化词元优化算法(\texttt{RTO}),该算法从偏好数据中学习词元级奖励函数,并基于学习到的词元级奖励信号进行策略优化。理论上,\texttt{RTO}被证明能够高效地找到接近最优的策略。在实际实现方面,\texttt{RTO}创新性地整合了直接偏好优化(DPO)和PPO。DPO最初源自稀疏句子奖励,却意外地为我们提供了响应质量的词元级表征,这被无缝地整合到后续的PPO训练阶段中。大量实验表明,\texttt{RTO}的表现优于PPO和其他直接偏好学习算法。具体而言,RTO在AlpacaEval 2基准测试中比PPO高出7.5分,在Arena-Hard上高出4.1分。我们的代码和模型已发布于\href{https://github.com/zkshan2002/RTO&#125;&#123;https://github.com/zkshan2002/RTO&#125;。


Optimizing Adaptive Attacks against Watermarks for Language Models

Abstract

arXiv:2410.02440v2 Announce Type: replace-cross Abstract: Large Language Models (LLMs) can be misused to spread unwanted content at scale. Content watermarking deters misuse by hiding messages in content, enabling its detection using a secret watermarking key. Robustness is a core security property, stating that evading detection requires (significant) degradation of the content's quality. Many LLM watermarking methods have been proposed, but robustness is tested only against non-adaptive attackers who lack knowledge of the watermarking method and can find only suboptimal attacks. We formulate watermark robustness as an objective function and use preference-based optimization to tune adaptive attacks against the specific watermarking method. Our evaluation shows that (i) adaptive attacks evade detection against all surveyed watermarks, (ii) training against any watermark succeeds in evading unseen watermarks, and (iii) optimization-based attacks are cost-effective. Our findings underscore the need to test robustness against adaptively tuned attacks. We release our adaptively optimized paraphrasers at https://github.com/nilslukas/ada-wm-evasion.

摘要

大型语言模型(LLMs)可能被滥用以大规模传播不良内容。内容水印技术通过隐藏信息来威慑滥用行为,借助秘密水印密钥实现内容检测。鲁棒性是其核心安全属性,指逃避检测需以(显著)降低内容质量为代价。尽管已有多种LLM水印方案被提出,但现有鲁棒性测试仅针对非适应性攻击者——这类攻击者缺乏水印方法知识且只能实施次优攻击。我们将水印鲁棒性建模为目标函数,并采用基于偏好的优化方法针对特定水印方案调适应性攻击。评估表明:(i)适应性攻击可逃避所有已调研水印的检测;(ii)针对任意水印的训练均能成功规避未知水印;(iii)基于优化的攻击具有成本效益。这些发现凸显了针对适应性调优攻击进行鲁棒性测试的必要性。我们在https://github.com/nilslukas/ada-wm-evasion发布了适应性优化的复述生成器。


Parameter Efficient Fine-tuning via Explained Variance Adaptation

Abstract

arXiv:2410.07170v4 Announce Type: replace-cross Abstract: Foundation models (FMs) are pre-trained on large-scale datasets and then fine-tuned for a specific downstream task. The most common fine-tuning method is to update pretrained weights via low-rank adaptation (LoRA). Existing initialization strategies for LoRA often rely on singular value decompositions (SVD) of gradients or weight matrices. However, they do not provably maximize the expected gradient signal, which is critical for fast adaptation. To this end, we introduce Explained Variance Adaptation (EVA), an initialization scheme that uses the directions capturing the most activation variance, provably maximizing the expected gradient signal and accelerating fine-tuning. EVA performs incremental SVD on minibatches of activation vectors and selects the right-singular vectors for initialization once they converged. Further, by selecting the directions that capture the most activation-variance for a given rank budget, EVA accommodates adaptive ranks that reduce the number of trainable parameters, while maintaining or improving downstream performance. We apply EVA to a variety of fine-tuning tasks as language generation and understanding, image classification, and reinforcement learning. EVA exhibits faster convergence than competitors and achieves the highest average score across a multitude of tasks per domain while reducing the number of trainable parameters through rank redistribution.

摘要

基础模型(FMs)通过大规模数据集预训练后,需针对特定下游任务进行微调。最常见的微调方法是采用低秩自适应(LoRA)更新预训练权重。现有LoRA初始化策略通常依赖于梯度或权重矩阵的奇异值分解(SVD),但这些方法无法被证明能最大化预期梯度信号——这对快速适应至关重要。为此,我们提出解释方差自适应(EVA),该初始化方案利用捕获最大激活方差的方向,可证明地最大化预期梯度信号并加速微调。EVA对激活向量的小批量数据执行增量SVD,并在其收敛后选择右奇异向量进行初始化。此外,通过为给定秩预算选择捕获最多激活方差的方向,EVA支持自适应秩配置,在减少可训练参数数量的同时保持或提升下游性能。我们将EVA应用于语言生成与理解、图像分类和强化学习等多种微调任务。实验表明,EVA比同类方法收敛更快,并通过秩重分配减少可训练参数,同时在每个领域的多项任务中取得最高平均得分。


Retrospective Learning from Interactions

Abstract

arXiv:2410.13852v2 Announce Type: replace-cross Abstract: Multi-turn interactions between large language models (LLMs) and users naturally include implicit feedback signals. If an LLM responds in an unexpected way to an instruction, the user is likely to signal it by rephrasing the request, expressing frustration, or pivoting to an alternative task. Such signals are task-independent and occupy a relatively constrained subspace of language, allowing the LLM to identify them even if it fails on the actual task. We introduce ReSpect, a method to learn from such signals in past interactions via retrospection without additional annotations. We deploy ReSpect in a new multimodal interaction scenario, where humans instruct a multimodal LLM to solve an abstract reasoning task with a combinatorial solution space. Through thousands of interactions with humans, we show how ReSpect gradually improves task completion rate from 31% to 82%, all without any external annotation.

摘要

大型语言模型(LLM)与用户之间的多轮交互天然包含隐式反馈信号。当LLM对指令作出意外响应时,用户通常会通过重述请求、表达沮丧或转向替代任务来传递信号。此类信号与具体任务无关,且处于相对受限的语言子空间中,使得LLM即使在实际任务失败时仍能识别它们。我们提出ReSpect方法,通过回顾历史交互中的此类信号进行学习,无需额外标注。我们将ReSpect部署于新型多模态交互场景中,由人类指导多模态LLM完成具有组合解空间的抽象推理任务。通过数千次人机交互实验,证明ReSpect逐步将任务完成率从31%提升至82%,且全程无需外部标注。


GATEAU: Selecting Influential Samples for Long Context Alignment

Abstract

arXiv:2410.15633v5 Announce Type: replace-cross Abstract: Aligning large language models to handle instructions with extremely long contexts has yet to be fully investigated. Previous studies have attempted to scale up the available data volume by synthesizing long instruction-following samples, as constructing such a dataset tends to be challenging for annotators. However, a lack of a well-defined strategy for ensuring data quality may introduce low-quality samples and restrict the model's performance. Thus, we propose GATEAU, a novel framework to address the unique challenge of long context alignment by identifying the influential samples enriched with long-range dependency relations. Specifically, GATEAU measures the long-range dependencies from two essential aspects: the difficulty of generating target responses due to the long-range dependencies, and the difficulty of understanding long inputs due to such dependencies. Comprehensive experiments indicate that GATEAU effectively identifies influential samples and the model trained on these selected samples exhibits better instruction-following and long-context understanding capabilities.

摘要

针对大语言模型在超长上下文指令对齐方面的研究尚不充分。现有研究试图通过合成长指令跟随样本来扩大数据规模,因为构建此类数据集对标注者具有较大挑战性。然而,由于缺乏明确的数据质量保障策略,可能导致低质量样本混入并限制模型性能。为此,我们提出GATEAU框架,通过识别富含长程依赖关系的关键样本来解决长上下文对齐这一独特挑战。具体而言,GATEAU从两个核心维度衡量长程依赖性:因长程依赖导致目标响应生成的难度,以及由此类依赖引发的长输入理解难度。综合实验表明,GATEAU能有效识别关键样本,基于所选样本训练的模型展现出更优的指令跟随能力和长上下文理解能力。


Quantifying Feature Space Universality Across Large Language Models via Sparse Autoencoders

Abstract

arXiv:2410.06981v4 Announce Type: replace-cross Abstract: The Universality Hypothesis in large language models (LLMs) claims that different models converge towards similar concept representations in their latent spaces. Providing evidence for this hypothesis would enable researchers to exploit universal properties, facilitating the generalization of mechanistic interpretability techniques across models. Previous works studied if LLMs learned the same features, which are internal representations that activate on specific concepts. Since comparing features across LLMs is challenging due to polysemanticity, in which LLM neurons often correspond to multiple unrelated features rather than to distinct concepts, sparse autoencoders (SAEs) have been employed to disentangle LLM neurons into SAE features corresponding to distinct concepts. In this paper, we introduce a new variation of the universality hypothesis called Analogous Feature Universality: we hypothesize that even if SAEs across different models learn different feature representations, the spaces spanned by SAE features are similar, such that one SAE space is similar to another SAE space under rotation-invariant transformations. Evidence for this hypothesis would imply that interpretability techniques related to latent spaces, such as steering vectors, may be transferred across models via certain transformations. To investigate this hypothesis, we first pair SAE features across different models via activation correlation, and then measure spatial relation similarities between paired features via representational similarity measures, which transform spaces into representations that reveal hidden relational similarities. Our experiments demonstrate high similarities for SAE feature spaces across various LLMs, providing evidence for feature space universality.

摘要

大语言模型(LLMs)的普适性假说认为,不同模型在其潜在空间中的概念表征会趋于相似。验证该假说将使研究者能够利用普适特性,促进机械可解释性技术在跨模型间的推广。先前研究关注LLMs是否学习了相同特征(即针对特定概念激活的内部表征)。由于LLM神经元常对应多个无关特征而非独立概念(多义性),跨模型特征比较具有挑战性,因此采用稀疏自编码器(SAEs)将LLM神经元解耦为对应独立概念的SAE特征。本文提出"类比特征普适性"这一普适性假说的新变体:我们假设即使不同模型的SAEs学习到不同特征表征,SAE特征所张成的空间仍具有相似性,使得一个SAE空间可通过旋转不变变换与另一SAE空间相似。该假说的验证意味着与潜在空间相关的可解释性技术(如导向向量)可能通过特定变换跨模型迁移。为探究该假说,我们首先通过激活相关性跨模型配对SAE特征,随后利用表征相似性度量(将空间转换为能揭示隐藏关系相似性的表征)测量配对特征间的空间关系相似性。实验表明不同LLMs的SAE特征空间具有高度相似性,为特征空间普适性提供了证据。


A Closer Look at Machine Unlearning for Large Language Models

Abstract

arXiv:2410.08109v4 Announce Type: replace-cross Abstract: Large language models (LLMs) may memorize sensitive or copyrighted content, raising privacy and legal concerns. Due to the high cost of retraining from scratch, researchers attempt to employ machine unlearning to remove specific content from LLMs while preserving the overall performance. In this paper, we discuss several issues in machine unlearning for LLMs and provide our insights on possible approaches. To address the issue of inadequate evaluation of model outputs after unlearning, we introduce three additional metrics to evaluate token diversity, sentence semantics, and factual correctness. We then categorize unlearning methods into untargeted and targeted, and discuss their issues respectively. Specifically, the behavior that untargeted unlearning attempts to approximate is unpredictable and may involve hallucinations, and existing regularization is insufficient for targeted unlearning. To alleviate these issues, we propose using the objective of maximizing entropy (ME) for untargeted unlearning and incorporate answer preservation (AP) loss as regularization for targeted unlearning. Experimental results across three scenarios, i.e., fictitious unlearning, continual unlearning, and real-world unlearning, demonstrate the effectiveness of our approaches. The code is available at https://github.com/sail-sg/closer-look-LLM-unlearning.

摘要

大型语言模型(LLMs)可能记忆敏感或受版权保护的内容,引发隐私和法律问题。由于从头开始重新训练的高成本,研究者尝试采用机器遗忘技术从LLMs中移除特定内容,同时保持整体性能。本文探讨了LLMs机器遗忘中的若干问题,并就可能的方法提出了见解。针对遗忘后模型输出评估不足的问题,我们引入了三个新增指标来评估词汇多样性、句子语义和事实准确性。随后,我们将遗忘方法分为非定向与定向两类,并分别讨论其问题。具体而言,非定向遗忘试图逼近的行为具有不可预测性且可能涉及幻觉,而现有正则化方法对定向遗忘的约束不足。为缓解这些问题,我们提出采用最大化熵(ME)作为非定向遗忘的目标,并引入答案保留(AP)损失作为定向遗忘的正则化项。在虚构遗忘、持续遗忘和真实场景遗忘三种情境下的实验结果验证了我们方法的有效性。代码发布于https://github.com/sail-sg/closer-look-LLM-unlearning。


How to Enable Effective Cooperation Between Humans and NLP Models: A Survey of Principles, Formalizations, and Beyond

Abstract

arXiv:2501.05714v3 Announce Type: replace-cross Abstract: With the advancement of large language models (LLMs), intelligent models have evolved from mere tools to autonomous agents with their own goals and strategies for cooperating with humans. This evolution has birthed a novel paradigm in NLP, i.e., human-model cooperation, that has yielded remarkable progress in numerous NLP tasks in recent years. In this paper, we take the first step to present a thorough review of human-model cooperation, exploring its principles, formalizations, and open challenges. In particular, we introduce a new taxonomy that provides a unified perspective to summarize existing approaches. Also, we discuss potential frontier areas and their corresponding challenges. We regard our work as an entry point, paving the way for more breakthrough research in this regard.

摘要

随着大语言模型(LLMs)的发展,智能模型已从单纯工具演变为具备自主目标和人类协作策略的自治智能体。这一进化催生了自然语言处理(NLP)领域的新范式——人机协作,该范式近年来在众多NLP任务中取得了显著进展。本文首次系统梳理了人机协作的研究现状,深入探讨其基本原理、形式化框架及开放挑战。特别地,我们提出了一种新的分类法,为现有方法提供统一视角的归纳总结。同时,我们讨论了潜在的前沿领域及其相应挑战。本研究旨在为该领域后续突破性研究提供基础性引导。


ROUTE: Robust Multitask Tuning and Collaboration for Text-to-SQL

Abstract

arXiv:2412.10138v2 Announce Type: replace-cross Abstract: Despite the significant advancements in Text-to-SQL (Text2SQL) facilitated by large language models (LLMs), the latest state-of-the-art techniques are still trapped in the in-context learning of closed-source LLMs (e.g., GPT-4), which limits their applicability in open scenarios. To address this challenge, we propose a novel RObust mUltitask Tuning and collaboration mEthod (ROUTE) to improve the comprehensive capabilities of open-source LLMs for Text2SQL, thereby providing a more practical solution. Our approach begins with multi-task supervised fine-tuning (SFT) using various synthetic training data related to SQL generation. Unlike existing SFT-based Text2SQL methods, we introduced several additional SFT tasks, including schema linking, noise correction, and continuation writing. Engaging in a variety of SQL generation tasks enhances the model's understanding of SQL syntax and improves its ability to generate high-quality SQL queries. Additionally, inspired by the collaborative modes of LLM agents, we introduce a Multitask Collaboration Prompting (MCP) strategy. This strategy leverages collaboration across several SQL-related tasks to reduce hallucinations during SQL generation, thereby maximizing the potential of enhancing Text2SQL performance through explicit multitask capabilities. Extensive experiments and in-depth analyses have been performed on eight open-source LLMs and five widely-used benchmarks. The results demonstrate that our proposal outperforms the latest Text2SQL methods and yields leading performance.

摘要

尽管大型语言模型(LLMs)推动了文本到SQL(Text2SQL)技术的显著进步,但当前最先进的方法仍局限于闭源LLMs(如GPT-4)的上下文学习中,这限制了其在开放场景中的适用性。为解决这一挑战,我们提出了一种新颖的鲁棒多任务调优与协作方法(ROUTE),以提升开源LLMs在Text2SQL中的综合能力,从而提供更实用的解决方案。我们的方法首先利用与SQL生成相关的多种合成训练数据进行多任务监督微调(SFT)。与现有基于SFT的Text2SQL方法不同,我们引入了多项附加SFT任务,包括模式链接、噪声校正和续写。通过参与多样化的SQL生成任务,模型增强了对SQL语法的理解,并提高了生成高质量SQL查询的能力。此外,受LLM智能体协作模式的启发,我们提出了一种多任务协作提示(MCP)策略。该策略通过多个SQL相关任务间的协作,减少SQL生成过程中的幻觉现象,从而通过显式的多任务能力最大化提升Text2SQL性能。我们在八个开源LLMs和五个广泛使用的基准测试上进行了大量实验与深入分析。结果表明,我们的方案优于最新的Text2SQL方法,并取得了领先的性能。


Lifelong Knowledge Editing requires Better Regularization

Abstract

arXiv:2502.01636v2 Announce Type: replace-cross Abstract: Knowledge editing is a promising way to improve factuality in large language models, but recent studies have shown significant model degradation during sequential editing. In this paper, we formalize the popular locate-then-edit methods as a two-step fine-tuning process, allowing us to precisely identify the root cause of this degradation. We show that model degradation occurs due to (1) over-optimization of internal activations and (2) continuous norm-growth of edited matrices. To mitigate these issues, we introduce two regularization techniques: (1) Most-Probable Early Stopping (MPES) and (2) explicit Frobenius norm-constraint. We demonstrate that applying these simple yet effective regularization techniques at key points in the editing process can substantially mitigate model degradation. Combining these regularization methods enables scaling locate-then-edit methods to 10,000 edits while reducing editing time by 42-61%. These results show that targeted regularization is essential for lifelong knowledge editing.

摘要

知识编辑是提升大语言模型事实准确性的有效方法,但近期研究表明连续编辑会导致模型性能显著下降。本文通过将流行的"定位-编辑"方法形式化为两阶段微调过程,精确揭示了性能下降的根本原因。我们发现模型退化源于:(1)内部激活的过度优化;(2)编辑矩阵的持续范数增长。为缓解这些问题,我们提出两种正则化技术:(1)最大概率早停法(MPES);(2)显式Frobenius范数约束。实验证明,在编辑过程关键节点应用这些简单而有效的正则化技术能显著减轻模型退化。结合这些正则化方法后,"定位-编辑"方法可扩展至10,000次编辑,同时减少42-61%的编辑时间。这些结果表明定向正则化对于终身知识编辑至关重要。


Probing Semantic Routing in Large Mixture-of-Expert Models

Abstract

arXiv:2502.10928v2 Announce Type: replace-cross Abstract: In the past year, large (>100B parameter) mixture-of-expert (MoE) models have become increasingly common in the open domain. While their advantages are often framed in terms of efficiency, prior work has also explored functional differentiation through routing behavior. We investigate whether expert routing in large MoE models is influenced by the semantics of the inputs. To test this, we design two controlled experiments. First, we compare activations on sentence pairs with a shared target word used in the same or different senses. Second, we fix context and substitute the target word with semantically similar or dissimilar alternatives. Comparing expert overlap across these conditions reveals clear, statistically significant evidence of semantic routing in large MoE models.

摘要

在过去一年中,大型(>1000亿参数)专家混合模型(MoE)在开放领域变得越来越普遍。尽管其优势常被归结为效率因素,但先前研究也通过路由行为探索了功能分化现象。本研究旨在探究大型MoE模型中的专家路由是否受到输入语义的影响。为此我们设计了两组对照实验:首先比较具有相同目标词(使用相同或不同词义)的句子对的激活情况;其次固定上下文环境,用语义相似或相异词汇替换目标词。通过对比不同条件下的专家重叠度,我们发现了大型MoE模型中存在语义路由的明确且具有统计学意义的证据。


Can LLMs Maintain Fundamental Abilities under KV Cache Compression?

Abstract

arXiv:2502.01941v2 Announce Type: replace-cross Abstract: This paper investigates an underexplored challenge in large language models (LLMs): the impact of KV cache compression methods on LLMs' fundamental capabilities. Although existing methods achieve impressive compression ratios on long-context benchmarks, their effects on core model capabilities remain understudied. We present a comprehensive benchmark KVFundaBench to systematically evaluate the effects of KV cache compression across diverse fundamental LLM capabilities, spanning world knowledge, commonsense reasoning, arithmetic reasoning, code generation, safety, and long-context understanding and generation.Our analysis reveals serval key findings: (1) \textit{Task-Dependent Degradation}; (2) \textit{Model-Type Robustness} (3) \textit{Prompt Length Vulnerability}; (4) \textit{Chunk-Level Superiority}; (5) \textit{Prompt-Gain Sensitivity}; (6) \textit{Long-Context Generation Sensitivity}. Based on our analysis of attention patterns and cross-task compression performance, we propose ShotKV, a novel compression approach that distinctly handles prefill and decoding phases while maintaining shot-level semantic coherence. Empirical results show that ShotKV achieves 9%9\%-18%18\% performance improvements on long-context generation tasks under aggressive compression ratios.

摘要

本文研究了大语言模型(LLMs)中一个尚未充分探索的挑战:KV缓存压缩方法对模型基础能力的影响。尽管现有方法在长上下文基准测试中实现了令人印象深刻的压缩比,但其对模型核心能力的影响仍缺乏深入研究。我们提出了一个综合性基准测试KVFundaBench,系统评估KV缓存压缩对LLMs多种基础能力的影响,涵盖世界知识、常识推理、算术推理、代码生成、安全性以及长上下文理解和生成等领域。通过分析我们得出若干关键发现:(1)任务依赖性退化;(2)模型类型鲁棒性;(3)提示长度脆弱性;(4)分块级别优势;(5)提示增益敏感性;(6)长上下文生成敏感性。基于对注意力模式和跨任务压缩性能的分析,我们提出ShotKV这一新型压缩方法,该方法独特地处理预填充和解码阶段,同时保持片段级语义连贯性。实验结果表明,在高压缩比条件下,ShotKV在长上下文生成任务上实现了9%-18%的性能提升。


OceanChat: The Effect of Virtual Conversational AI Agents on Sustainable Attitude and Behavior Change

Abstract

arXiv:2502.02863v2 Announce Type: replace-cross Abstract: Marine ecosystems face unprecedented threats from climate change and plastic pollution, yet traditional environmental education often struggles to translate awareness into sustained behavioral change. This paper presents OceanChat, an interactive system leveraging large language models to create conversational AI agents represented as animated marine creatures -- specifically a beluga whale, a jellyfish, and a seahorse -- designed to promote environmental behavior (PEB) and foster awareness through personalized dialogue. Through a between-subjects experiment (N=900), we compared three conditions: (1) Static Scientific Information, providing conventional environmental education through text and images; (2) Static Character Narrative, featuring first-person storytelling from 3D-rendered marine creatures; and (3) Conversational Character Narrative, enabling real-time dialogue with AI-powered marine characters. Our analysis revealed that the Conversational Character Narrative condition significantly increased behavioral intentions and sustainable choice preferences compared to static approaches. The beluga whale character demonstrated consistently stronger emotional engagement across multiple measures, including perceived anthropomorphism and empathy. However, impacts on deeper measures like climate policy support and psychological distance were limited, highlighting the complexity of shifting entrenched beliefs. Our work extends research on sustainability interfaces facilitating PEB and offers design principles for creating emotionally resonant, context-aware AI characters. By balancing anthropomorphism with species authenticity, OceanChat demonstrates how interactive narratives can bridge the gap between environmental knowledge and real-world behavior change.

摘要

海洋生态系统正面临气候变化与塑料污染带来的空前威胁,然而传统环境教育往往难以将认知转化为持久的行为改变。本研究提出OceanChat系统,通过大型语言模型驱动的对话式AI代理(以白鲸、水母和海马三种动画海洋生物形象呈现)促进亲环境行为,并借助个性化对话提升环保意识。我们采用组间实验设计(N=900),对比三种干预方式:(1)静态科学信息组,通过图文提供传统环境教育;(2)静态角色叙事组,采用三维建模海洋生物的第一人称叙述;(3)对话角色叙事组,实现与AI海洋生物的实时对话。分析表明,对话角色叙事组在行为意向和可持续选择偏好上显著优于静态干预。白鲸角色在拟人化感知、共情等多项指标中持续展现更强的情感联结。然而对气候政策支持度、心理距离等深层指标影响有限,揭示了转变固有信念的复杂性。本研究拓展了促进亲环境行为的可持续界面研究,并为创建情感共鸣、情境感知的AI角色提供设计原则。OceanChat通过平衡拟人化与物种真实性,展示了交互叙事如何弥合环境知识与现实行为改变之间的鸿沟。


Sparsity May Be All You Need: Sparse Random Parameter Adaptation

Abstract

arXiv:2502.15975v2 Announce Type: replace-cross Abstract: Full fine-tuning of large language models for alignment and task adaptation has become prohibitively expensive as models have grown in size. Parameter-Efficient Fine-Tuning (PEFT) methods aim at significantly reducing the computational and memory resources needed for fine-tuning these models by only training on a small number of parameters instead of all model parameters. Currently, the most popular PEFT method is the Low-Rank Adaptation (LoRA), which freezes the parameters of the model to be fine-tuned and introduces a small set of trainable parameters in the form of low-rank matrices. We propose simply reducing the number of trainable parameters by randomly selecting a small proportion of the model parameters to train on. In this paper, we compare the efficiency and performance of our proposed approach with PEFT methods, including LoRA, as well as full parameter fine-tuning.

摘要

随着模型规模的不断扩大,对大型语言模型进行全面微调以实现对齐和任务适配已变得极其昂贵。参数高效微调(PEFT)方法旨在通过仅训练少量参数而非全部模型参数,显著减少微调这些模型所需的计算和内存资源。目前最流行的PEFT方法是低秩适配(LoRA),该方法冻结待微调模型的参数,并以低秩矩阵的形式引入一小部分可训练参数。我们提出了一种简单的方法,即通过随机选择模型参数的一小部分进行训练来减少可训练参数的数量。在本文中,我们将所提方法与包括LoRA在内的PEFT方法以及全参数微调在效率和性能方面进行了比较。


Automated Visualization Code Synthesis via Multi-Path Reasoning and Feedback-Driven Optimization

Abstract

arXiv:2502.11140v2 Announce Type: replace-cross Abstract: Rapid advancements in Large Language Models (LLMs) have accelerated their integration into automated visualization code generation applications. Despite advancements through few-shot prompting and query expansion, existing methods remain limited in handling ambiguous and complex queries, thereby requiring manual intervention. To overcome these limitations, we propose VisPath: a Multi-Path Reasoning and Feedback-Driven Optimization Framework for Visualization Code Generation. VisPath handles underspecified queries through structured, multi-stage processing. It begins by reformulating the user input via Chain-of-Thought (CoT) prompting, which refers to the initial query while generating multiple extended queries in parallel, enabling the LLM to capture diverse interpretations of the user intent. These queries then generate candidate visualization scripts, which are executed to produce diverse images. By assessing the visual quality and correctness of each output, VisPath generates targeted feedback that is aggregated to synthesize an optimal final result. Extensive experiments on widely-used benchmarks including MatPlotBench and the Qwen-Agent Code Interpreter Benchmark show that VisPath outperforms state-of-the-art methods, offering a more reliable solution for AI-driven visualization code generation.

摘要

大语言模型(LLMs)的快速发展加速了其在自动化可视化代码生成应用中的集成。尽管通过少样本提示和查询扩展取得了进展,现有方法在处理模糊和复杂查询时仍存在局限,需要人工干预。为克服这些限制,我们提出VisPath:一种面向可视化代码生成的多路径推理与反馈驱动优化框架。VisPath通过结构化多阶段处理应对欠明确查询:首先通过思维链(CoT)提示重构用户输入,在生成多个并行扩展查询时参考初始查询,使LLM能捕捉用户意图的多样化解读;随后基于这些查询生成候选可视化脚本,执行后产生多样化图像。通过评估每个输出的视觉质量与正确性,VisPath生成针对性反馈并聚合以合成最优最终结果。在MatPlotBench和Qwen-Agent代码解释器基准等广泛使用的测试集上进行的实验表明,VisPath优于现有最先进方法,为AI驱动的可视化代码生成提供了更可靠的解决方案。


The Jumping Reasoning Curve? Tracking the Evolution of Reasoning Performance in GPT-[n] and o-[n] Models on Multimodal Puzzles

Abstract

arXiv:2502.01081v2 Announce Type: replace-cross Abstract: The releases of OpenAI's o-[n] series, such as o1, o3, and o4-mini, mark a significant paradigm shift in Large Language Models towards advanced reasoning capabilities. Notably, models like o3 have demonstrated strong performance on benchmarks like the Abstraction and Reasoning Corpus for Artificial General Intelligence (ARC-AGI). However, this benchmark is limited to symbolic patterns, whereas humans often perceive and reason about multimodal scenarios involving both vision and language data. Thus, there is an urgent need to investigate advanced reasoning capabilities in multimodal tasks. To this end, we track the evolution of the GPT-[n] and o-[n] series models (including o1, o3, and o4-mini) on challenging multimodal puzzles from PuzzleVQA and AlgoPuzzleVQA, which demand fine-grained visual perception. Our results reveal that o-[n] series, particularly later iterations like o3 and o4-mini, significantly outperform the GPT-[n] series and show strong scalability in multimodal reasoning. Nonetheless, despite these substantial advancements and the superior capabilities demonstrated by the o-[n] series, our findings highlight that even these leading models face persistent challenges. Difficulties are particularly evident in tasks requiring precise visual perception, robust compositional reasoning across multiple visual attributes, and solving complex algorithmic or highly combinatorial puzzles, indicating critical areas for future AGI development. We plan to continuously track new models in the series and update our results in this paper accordingly. All resources used in this evaluation are openly available at https://github.com/declare-lab/LLM-PuzzleTest.

摘要

OpenAI发布的o-[n]系列模型(如o1、o3和o4-mini)标志着大语言模型向高级推理能力的重要范式转变。值得注意的是,o3等模型在人工通用智能抽象与推理语料库(ARC-AGI)等基准测试中展现出强劲性能。然而该基准仅局限于符号模式,而人类通常需要处理涉及视觉与语言数据的多模态场景推理。因此,亟需探究多模态任务中的高级推理能力。为此,我们追踪了GPT-[n]和o-[n]系列模型(包括o1、o3和o4-mini)在PuzzleVQA和AlgoPuzzleVQA挑战性多模态谜题上的演进表现,这些任务需要细粒度视觉感知。实验结果表明,o-[n]系列(尤其是o3和o4-mini等后期版本)显著优于GPT-[n]系列,并展现出强大的多模态推理可扩展性。尽管如此,即便o-[n]系列取得重大进展并表现出卓越能力,我们的研究仍揭示出这些领先模型面临持续挑战:在需要精确视觉感知、跨多视觉属性的强健组合推理、以及解决复杂算法或高度组合性谜题的任务中表现尤为困难,这些发现为未来AGI发展指明了关键方向。我们将持续追踪该系列新模型并相应更新本文结果。本评估所有资源已公开于https://github.com/declare-lab/LLM-PuzzleTest。


Memory Is Not the Bottleneck: Cost-Efficient Continual Learning via Weight Space Consolidation

Abstract

arXiv:2502.07274v3 Announce Type: replace-cross Abstract: Continual learning (CL) has traditionally emphasized minimizing exemplar memory usage, assuming that memory is the primary bottleneck. However, in modern computing environments-particularly those involving large foundation models-memory is inexpensive and abundant, while GPU time constitutes the main cost. This paper re-examines CL under a more realistic setting with sufficient exemplar memory, where the system can retain a representative portion of past data. We find that, under this regime, stability improves due to reduced forgetting, but plasticity diminishes as the model becomes biased toward prior tasks and struggles to adapt to new ones. Notably, even simple baselines like naive replay can match or exceed the performance of state-of-the-art methods at a fraction of the computational cost. Building on this insight, we propose a lightweight yet effective method called Weight Space Consolidation, which directly operates in the model's weight space via two core mechanisms: (1) rank-based parameter resets to recover plasticity, and (2) weight averaging to enhance stability. Our approach outperforms strong baselines across class-incremental learning with image classifiers and continual instruction tuning with large language models, while requiring only one-third to one-fourth of the training cost. These findings challenge long-standing CL assumptions and establish a new, cost-efficient baseline for real-world continual learning systems where exemplar memory is no longer the limiting factor.

摘要

持续学习(CL)研究传统上强调最小化样本内存使用,其假设内存是主要瓶颈。然而在现代计算环境(尤其是涉及大型基础模型的场景)中,内存成本低廉且充足,而GPU时间成为主要开销。本文在具备充足样本内存的现实场景下重新审视CL,此时系统可保留具有代表性的历史数据。研究发现:在此机制下,稳定性因遗忘减少而提升,但可塑性会因模型偏向先前任务而下降,导致新任务适应困难。值得注意的是,即使简单基线方法(如原始回放)也能以极低计算成本达到或超越最先进方法的性能。基于此发现,我们提出一种轻量高效的方法——权重空间巩固,该方法通过两个核心机制直接在模型权重空间操作:(1)基于秩的参数重置以恢复可塑性;(2)权重平均以增强稳定性。该方法在图像分类器的类增量学习和大语言模型的持续指令调优中均优于强基线方法,同时仅需三分之一至四分之一的训练成本。这些发现挑战了长期存在的CL假设,并为样本内存不再受限的现实持续学习系统建立了新的成本效益基准。


Adaptively profiling models with task elicitation

Abstract

arXiv:2503.01986v2 Announce Type: replace-cross Abstract: Language model evaluations often fail to characterize consequential failure modes, forcing experts to inspect outputs and build new benchmarks. We introduce task elicitation, a method that automatically builds new evaluations to profile model behavior. Task elicitation finds hundreds of natural-language tasks -- an order of magnitude more than prior work -- where frontier models exhibit systematic failures, in domains ranging from forecasting to online harassment. For example, we find that Sonnet 3.5 over-associates quantum computing and AGI and that o3-mini is prone to hallucination when fabrications are repeated in-context.

摘要

语言模型评估往往无法捕捉关键失效模式,迫使专家必须人工检查输出并构建新基准。我们提出任务启发法——一种自动构建新评估以剖析模型行为的方法。该方法发现了数百项自然语言任务(数量级超越先前研究),在这些涉及预测到网络骚扰等多个领域的任务中,前沿模型表现出系统性缺陷。例如,我们发现Sonnet 3.5会过度关联量子计算与通用人工智能,而o3-mini在上下文重复虚构内容时容易出现幻觉现象。


BARE: Leveraging Base Language Models for Few-Shot Synthetic Data Generation

Abstract

arXiv:2502.01697v3 Announce Type: replace-cross Abstract: As the demand for high-quality data in model training grows, researchers and developers are increasingly generating synthetic data to tune and train LLMs. However, current data generation methods rely on seed sets containing tens of thousands of examples to prompt instruction-tuned models. This reliance can be especially problematic when the curation of high-quality examples is expensive or difficult. In this paper we explore the novel few-shot synthetic data generation setting -- generating a high-quality dataset from a few examples. We show that when working with only a few seed examples, instruction-tuned models used in current synthetic data methods produce insufficient diversity for downstream tasks. In contrast, we show that base models without post-training, largely untapped for synthetic data generation, offer substantially greater output diversity, albeit with lower instruction following abilities. Leveraging this insight, we propose Base-Refine (BARE), a novel two-stage method that combines the diversity of base models with the quality assurance of instruction-tuned models. BARE excels in few-shot synthetic data generation: using only 3 seed examples it generates diverse, high-quality datasets that significantly improve downstream task performance. We show that fine-tuning Llama 3.1 8B with 1,000 BARE-generated samples achieves performance comparable to state-of-the-art similarly sized models on LiveCodeBench tasks. Furthermore, data generated with BARE enables a 101% improvement for a fine-tuned Llama 3.2 1B on GSM8K over data generated by only instruction-models, and an 18.4% improvement for a fine-tuned Llama 3.1 8B over the state-of-the-art RAFT method for RAG data generation.

摘要

随着模型训练对高质量数据需求的增长,研究人员和开发者越来越多地通过生成合成数据来微调和大语言模型训练。然而,当前的数据生成方法依赖于包含数万条样例的种子集来提示指令微调模型,当高质量样例的筛选成本高昂或难度较大时,这种依赖性会带来显著问题。本文探索了新颖的小样本合成数据生成场景——仅通过少量样例生成高质量数据集。我们发现,当仅使用少量种子样例时,当前合成数据方法采用的指令微调模型产生的输出多样性不足。相比之下,未经后训练的基础模型(目前在合成数据生成领域尚未充分开发)能提供显著更高的输出多样性,尽管其指令遵循能力较弱。基于这一发现,我们提出基础-精炼两阶段方法(BARE),该方法结合了基础模型的多样性与指令微调模型的质量保证优势。BARE在小样本合成数据生成中表现卓越:仅需3个种子样例即可生成多样化的高质量数据集,显著提升下游任务性能。实验表明,使用1,000个BARE生成样本微调的Llama 3.1 8B模型,在LiveCodeBench任务上达到与同类先进模型相当的性能。此外,相比纯指令模型生成的数据,BARE生成数据使微调后的Llama 3.2 1B在GSM8K上获得101%的性能提升;在RAG数据生成任务中,微调后的Llama 3.1 8B较当前最先进的RAFT方法实现18.4%的性能提升。


Neurons Speak in Ranges: Breaking Free from Discrete Neuronal Attribution

Abstract

arXiv:2502.06809v2 Announce Type: replace-cross Abstract: Interpreting the internal mechanisms of large language models (LLMs) is crucial for improving their trustworthiness and utility. Prior work has primarily focused on mapping individual neurons to discrete semantic concepts. However, such mappings struggle to handle the inherent polysemanticity in LLMs, where individual neurons encode multiple, distinct concepts. Through a comprehensive analysis of both encoder and decoder-based LLMs across diverse datasets, we observe that even highly salient neurons, identified via various attribution techniques for specific semantic concepts, consistently exhibit polysemantic behavior. Importantly, activation magnitudes for fine-grained concepts follow distinct, often Gaussian-like distributions with minimal overlap. This observation motivates a shift from neuron attribution to range-based interpretation. We hypothesize that interpreting and manipulating neuron activation ranges would enable more precise interpretability and targeted interventions in LLMs. To validate our hypothesis, we introduce NeuronLens, a novel range-based interpretation and manipulation framework that provides a finer view of neuron activation distributions to localize concept attribution within a neuron. Extensive empirical evaluations demonstrate that NeuronLens significantly reduces unintended interference, while maintaining precise manipulation of targeted concepts, outperforming neuron attribution.

摘要

理解大型语言模型(LLMs)的内部机制对于提升其可信度与实用性至关重要。现有研究主要集中于将单个神经元映射至离散语义概念,但此类方法难以处理LLMs中固有的多义性现象——即单个神经元编码多个不同概念的特性。通过对基于编码器与解码器架构的多种LLMs进行跨数据集综合分析,我们发现:即便通过各类归因技术识别出的、针对特定语义概念的高度显著神经元,也始终表现出多义行为。关键的是,细粒度概念的激活强度遵循彼此独立且通常呈类高斯分布的模式,重叠区域极小。这一发现促使我们从神经元归因转向基于区间的解释范式。我们提出假设:通过解释和操控神经元激活区间,可以实现更精确的模型可解释性及针对性干预。为验证该假设,我们开发了NeuronLens——一种新型的基于区间的解释与操控框架,该框架通过细化神经元激活分布来实现概念归因的精准定位。大量实证评估表明,NeuronLens在保持对目标概念精确操控的同时,能显著减少非预期干扰,其性能优于传统神经元归因方法。


CodeI/O: Condensing Reasoning Patterns via Code Input-Output Prediction

Abstract

arXiv:2502.07316v4 Announce Type: replace-cross Abstract: Reasoning is a fundamental capability of Large Language Models. While prior research predominantly focuses on enhancing narrow skills like math or code generation, improving performance on many other reasoning tasks remains challenging due to sparse and fragmented training data. To address this issue, we propose CodeI/O, a novel approach that systematically condenses diverse reasoning patterns inherently embedded in contextually-grounded codes, through transforming the original code into a code input-output prediction format. By training models to predict inputs/outputs given code and test cases entirely in natural language as Chain-of-Thought (CoT) rationales, we expose them to universal reasoning primitives -- like logic flow planning, state-space searching, decision tree traversal, and modular decomposition -- while decoupling structured reasoning from code-specific syntax and preserving procedural rigor. Experimental results demonstrate CodeI/O leads to consistent improvements across symbolic, scientific, logic, math & numerical, and commonsense reasoning tasks. By matching the existing ground-truth outputs or re-executing the code with predicted inputs, we can verify each prediction and further enhance the CoTs through multi-turn revision, resulting in CodeI/O++ and achieving higher performance. Our data and models are available at https://github.com/hkust-nlp/CodeIO.

摘要

推理是大型语言模型的一项基本能力。尽管先前研究主要集中于提升数学或代码生成等专项技能,但由于训练数据稀疏且碎片化,提升其他众多推理任务的性能仍具挑战性。为解决这一问题,我们提出CodeI/O方法——通过将原始代码转换为代码输入-输出预测格式,系统性地凝练语境化代码中内嵌的多样化推理模式。通过训练模型以自然语言链式思维(CoT)的形式,在给定代码和测试用例条件下预测输入/输出,我们使其接触通用推理原语(如逻辑流规划、状态空间搜索、决策树遍历和模块化分解),同时将结构化推理与代码特定语法解耦并保持过程严谨性。实验结果表明,CodeI/O在符号推理、科学推理、逻辑推理、数学与数值推理及常识推理任务中均取得持续改进。通过匹配现有真实输出或使用预测输入重新执行代码,我们可验证每个预测结果,并通过多轮修正进一步优化链式思维,由此形成CodeI/O++并获得更优性能。数据与模型已开源于https://github.com/hkust-nlp/CodeIO。


Rapid Word Learning Through Meta In-Context Learning

Abstract

arXiv:2502.14791v2 Announce Type: replace-cross Abstract: Humans can quickly learn a new word from a few illustrative examples, and then systematically and flexibly use it in novel contexts. Yet the abilities of current language models for few-shot word learning, and methods for improving these abilities, are underexplored. In this study, we introduce a novel method, Meta-training for IN-context learNing Of Words (Minnow). This method trains language models to generate new examples of a word's usage given a few in-context examples, using a special placeholder token to represent the new word. This training is repeated on many new words to develop a general word-learning ability. We find that training models from scratch with Minnow on human-scale child-directed language enables strong few-shot word learning, comparable to a large language model (LLM) pre-trained on orders of magnitude more data. Furthermore, through discriminative and generative evaluations, we demonstrate that finetuning pre-trained LLMs with Minnow improves their ability to discriminate between new words, identify syntactic categories of new words, and generate reasonable new usages and definitions for new words, based on one or a few in-context examples. These findings highlight the data efficiency of Minnow and its potential to improve language model performance in word learning tasks.

摘要

人类能够通过少量示例快速学习新词,并在新语境中系统灵活地运用该词汇。然而当前语言模型在少量样本词汇学习方面的能力及其优化方法尚未得到充分探索。本研究提出一种创新方法——语境词汇学习的元训练(Minnow),该方法通过特殊占位符表征新词,训练语言模型根据少量上下文示例生成该词的新用法。通过对大量新词进行重复训练,从而培养通用词汇学习能力。实验发现:基于儿童导向语料规模,采用Minnow从头训练的模型展现出与海量数据预训练大语言模型(LLM)相当的强效少量样本词汇学习能力。进一步通过判别式与生成式评估表明,基于Minnow对预训练LLM进行微调后,模型能根据单个或少量上下文示例:更有效区分新词、识别新词语法范畴、生成合理的新用法及定义。这些发现凸显了Minnow的数据高效性及其在提升语言模型词汇学习任务性能方面的潜力。


Spontaneous Giving and Calculated Greed in Language Models

Abstract

arXiv:2502.17720v3 Announce Type: replace-cross Abstract: Large language models demonstrate strong problem-solving abilities through reasoning techniques such as chain-of-thought prompting and reflection. However, it remains unclear whether these reasoning capabilities extend to a form of social intelligence: making effective decisions in cooperative contexts. We examine this question using economic games that simulate social dilemmas. First, we apply chain-of-thought and reflection prompting to GPT-4o in a Public Goods Game. We then evaluate multiple off-the-shelf models across six cooperation and punishment games, comparing those with and without explicit reasoning mechanisms. We find that reasoning models consistently reduce cooperation and norm enforcement, favoring individual rationality. In repeated interactions, groups with more reasoning agents exhibit lower collective gains. These behaviors mirror human patterns of "spontaneous giving and calculated greed." Our findings underscore the need for LLM architectures that incorporate social intelligence alongside reasoning, to help address--rather than reinforce--the challenges of collective action.

摘要

大型语言模型通过思维链提示和反思等推理技术展现出强大的问题解决能力。然而,这些推理能力是否适用于社会智能的一种形式——在合作情境中做出有效决策——仍不明确。我们采用模拟社会困境的经济博弈来研究这一问题。首先,在公共物品博弈中对GPT-4o应用思维链和反思提示技术,随后在六种合作与惩罚博弈中评估多个现成模型,比较具备与不具备显式推理机制的模型表现。研究发现,推理模型持续降低合作意愿与规范执行力度,倾向于个体理性选择。在重复互动中,推理智能体比例较高的群体表现出更低的集体收益。这些行为模式与人类'自发奉献与精算贪婪'的特征相吻合。我们的发现表明,需要在语言模型架构中整合社会智能与推理能力,以助力解决(而非强化)集体行动难题。


Benchmarking Post-Training Quantization in LLMs: Comprehensive Taxonomy, Unified Evaluation, and Comparative Analysis

Abstract

arXiv:2502.13178v4 Announce Type: replace-cross Abstract: Post-training Quantization (PTQ) technique has been extensively adopted for large language models (LLMs) compression owing to its efficiency and low resource requirement. However, current research lacks a in-depth analysis of the superior and applicable scenarios of each PTQ strategy. In addition, existing algorithms focus primarily on performance, overlooking the trade-off among model size, performance, and quantization bitwidth. To mitigate these confusions, we provide a novel benchmark for LLMs PTQ in this paper. Firstly, in order to support our benchmark, we propose a comprehensive taxonomy for existing mainstream methods by scrutinizing their computational strategies (e.g., optimization-based, compensation-based, etc.). Then, we conduct extensive experiments with the baseline within each class, covering models with various sizes (7B-70B), bitwidths, training levels (LLaMA1/2/3/3.1), architectures (Mixtral, DeepSeekMoE and Mamba) and modality (LLaVA1.5 and VILA1.5) on a wide range of evaluation metrics.Through comparative analysis on the results, we summarize the superior of each PTQ strategy and modelsize-bitwidth trade-off considering the performance. For example, our benchmark reveals that compensation-based technique demonstrates outstanding cross-architecture robustness and extremely low-bit PTQ for ultra large models should be reexamined. Finally, we further accordingly claim that a practical combination of compensation and other PTQ strategy can achieve SOTA various robustness. We believe that our benchmark will provide valuable recommendations for the deployment of LLMs and future research on PTQ approaches.We conduct an repository for our benchmark at https://github.com/zjq0455/PTQ_Benchmark.

摘要

训练后量化(PTQ)技术因其高效性和低资源需求,已被广泛应用于大语言模型(LLM)压缩。然而,当前研究缺乏对每种PTQ策略优势及适用场景的深入分析。此外,现有算法主要关注性能,忽略了模型大小、性能与量化位宽之间的权衡。为解决这些问题,本文提出了一个新颖的LLM PTQ基准测试框架。首先,为支撑该基准,我们通过系统梳理现有主流方法的计算策略(如基于优化、基于补偿等),提出了一套全面的分类体系。随后,我们在每类方法中选择基线模型进行大量实验,涵盖不同规模(7B-70B)、位宽、训练阶段(LLaMA1/2/3/3.1)、架构(Mixtral、DeepSeekMoE和Mamba)及模态(LLaVA1.5和VILA1.5)的模型,并采用广泛的评估指标。通过对结果的对比分析,我们总结了各PTQ策略的优势以及考虑性能的模型大小-位宽权衡关系。例如,基准测试表明基于补偿的技术展现出卓越的跨架构鲁棒性,而超大规模模型的极低位宽PTQ需重新审视。最后,我们进一步提出补偿策略与其他PTQ方法的实用组合可实现多种鲁棒性的最先进水平。相信本基准测试将为LLM部署及PTQ方法的未来研究提供有价值的参考。基准测试资源库详见https://github.com/zjq0455/PTQ_Benchmark。


Scaling Laws for Many-Shot In-Context Learning with Self-Generated Annotations

Abstract

arXiv:2503.03062v2 Announce Type: replace-cross Abstract: The high cost of obtaining high-quality annotated data for in-context learning (ICL) has motivated the development of methods that use self-generated annotations in place of ground-truth labels. While these approaches have shown promising results in few-shot settings, they generally do not scale to many-shot scenarios. In this work, we study ICL with self-generated examples using a framework analogous to traditional semi-supervised learning, consisting of annotation generation, demonstration selection, and in-context inference. Within this framework, we propose a simple baseline that outperforms ground-truth ICL in zero-shot, few-shot, and many-shot settings. Notably, we observe a scaling law with this baseline, where optimal performance is achieved with more than 1,000 demonstrations. To fully exploit the many-shot capabilities of semi-supervised ICL, we introduce IterPSD, an iterative annotation approach that integrates iterative refinement and curriculum pseudo-labeling techniques from semi-supervised learning, yielding up to 6.8% additional gains on classification tasks.

摘要

情境学习(ICL)中获取高质量标注数据的高成本,促使研究者开发使用自生成标注替代真实标签的方法。尽管这些方法在少样本场景中展现出良好效果,但通常难以扩展至多样本场景。本研究通过类比传统半监督学习的框架(包含标注生成、示例选择和情境推理三部分),探索了基于自生成示例的情境学习。在此框架下,我们提出一种简单基线方法,其在零样本、少样本和多样本场景中均优于真实标注的情境学习。值得注意的是,该方法呈现出规模效应规律——当演示样本超过1,000个时达到最佳性能。为充分发挥半监督情境学习的多样本潜力,我们提出IterPSD方法:一种融合半监督学习中迭代优化和课程伪标签技术的迭代标注策略,在分类任务上最高可获得6.8%的额外性能提升。


Sketch-of-Thought: Efficient LLM Reasoning with Adaptive Cognitive-Inspired Sketching

Abstract

arXiv:2503.05179v2 Announce Type: replace-cross Abstract: Recent advances in large language models (LLMs) have enabled strong reasoning capabilities through Chain-of-Thought (CoT) prompting, which elicits step-by-step problem solving, but often at the cost of excessive verbosity in intermediate outputs, leading to increased computational overhead. We propose Sketch-of-Thought (SoT), a prompting framework that integrates cognitively inspired reasoning paradigms with linguistic constraints to reduce token usage while preserving reasoning accuracy. SoT is designed as a flexible, modular approach and is instantiated with three paradigms--Conceptual Chaining, Chunked Symbolism, and Expert Lexicons--each tailored to distinct reasoning tasks and selected dynamically at test-time by a lightweight routing model. Across 15 reasoning datasets spanning multiple domains, languages, and modalities, SoT achieves token reductions of up to 78% with minimal accuracy loss. In tasks such as mathematical and multi-hop reasoning, it even improves accuracy while shortening outputs.

摘要

大语言模型(LLM)的最新进展通过思维链(CoT)提示实现了强大的推理能力,这种逐步求解问题的方法常伴随中间输出冗长的问题,导致计算开销增加。我们提出思维草图(SoT)提示框架,该框架将认知启发的推理范式与语言约束相结合,在保持推理准确性的同时减少标记使用。SoT采用灵活模块化设计,实例化了三种推理范式——概念链式、分块符号化和专家词典,分别针对不同推理任务,并通过轻量级路由模型在测试时动态选择。在涵盖多领域、多语言和多模态的15个推理数据集上,SoT实现了最高78%的标记缩减且准确率损失极小。在数学推理和多跳推理等任务中,其甚至能在缩短输出的同时提升准确率。


Predictable Scale: Part I -- Optimal Hyperparameter Scaling Law in Large Language Model Pretraining

Abstract

arXiv:2503.04715v5 Announce Type: replace-cross Abstract: The impressive capabilities of Large Language Models (LLMs) across diverse tasks are now well-established, yet their effective deployment necessitates careful hyperparameter optimization. Through extensive empirical studies involving grid searches across diverse configurations, we discover universal scaling laws governing these hyperparameters: optimal learning rate follows a power-law relationship with both model parameters and data sizes, while optimal batch size scales primarily with data sizes. Our analysis reveals a convex optimization landscape for hyperparameters under fixed models and data size conditions. This convexity implies an optimal hyperparameter plateau. We contribute a universal, plug-and-play optimal hyperparameter tool for the community. Its estimated values on the test set are merely 0.09% away from the globally optimal LLM performance found via an exhaustive search. These laws demonstrate remarkable robustness across variations in model sparsity, training data distribution, and model shape. To our best known, this is the first work that unifies different model shapes and structures, such as Mixture-of-Experts models and dense transformers, as well as establishes optimal hyperparameter scaling laws across diverse data distributions. This exhaustive optimization process demands substantial computational resources, utilizing nearly one million NVIDIA H800 GPU hours to train 3,700 LLMs of varying sizes and hyperparameters from scratch and consuming approximately 100 trillion tokens in total. To facilitate reproducibility and further research, we will progressively release all loss measurements and model checkpoints through our designated repository https://step-law.github.io/

摘要

大型语言模型(LLMs)在多样化任务中展现的卓越能力已得到充分验证,但其有效部署仍需细致的超参数优化。通过涵盖多种配置的网格搜索实证研究,我们发现了支配这些超参数的普适缩放规律:最佳学习率与模型参数量及数据规模均呈现幂律关系,而最佳批处理规模主要随数据量变化。分析表明,在固定模型和数据规模条件下,超参数优化呈现凸性景观,这意味着存在最优超参数平台。我们为学界贡献了一个通用即插即用式最优超参数工具,其在测试集上的估计值与穷举搜索所得的全局最优LLM性能仅相差0.09%。这些规律在模型稀疏性、训练数据分布和模型形态的变化中表现出显著鲁棒性。据我们所知,这是首个统一专家混合模型与密集Transformer等不同模型形态结构,并建立跨数据分布最优超参数缩放规律的研究。该优化过程消耗了巨大计算资源,使用近百万NVIDIA H800 GPU小时从头训练3,700个不同规模与超参数的LLM,总计消耗约100万亿token。为促进可复现性与后续研究,我们将通过指定仓库https://step-law.github.io/逐步公开所有损失测量值与模型检查点。


Large Language Models are Powerful Electronic Health Record Encoders

Abstract

arXiv:2502.17403v3 Announce Type: replace-cross Abstract: Electronic Health Records (EHRs) offer considerable potential for clinical prediction, but their complexity and heterogeneity present significant challenges for traditional machine learning methods. Recently, domain-specific EHR foundation models trained on large volumes of unlabeled EHR data have shown improved predictive accuracy and generalization. However, their development is constrained by limited access to diverse, high-quality datasets, and by inconsistencies in coding standards and clinical practices. In this study, we explore the use of general-purpose Large Language Models (LLMs) to encode EHR into high-dimensional representations for downstream clinical prediction tasks. We convert structured EHR data into markdown-formatted plain text documents by replacing medical codes with natural language descriptions. This enables the use of LLMs and their extensive semantic understanding and generalization capabilities as effective encoders of EHRs without requiring access to private medical training data. We show that LLM-based embeddings can often match or even surpass the performance of a specialized EHR foundation model, CLMBR-T-Base, across 15 diverse clinical tasks from the EHRSHOT benchmark. To demonstrate generalizability, we further evaluate the approach on the UK Biobank (UKB) cohort, a population distinct from that used to train CLMBR-T-Base. Notably, one of the tested LLM-based models achieves superior performance for disease onset, hospitalization, and mortality prediction, highlighting robustness to shifts in patient populations. Our findings suggest that repurposed general-purpose LLMs for EHR encoding provide a scalable and generalizable alternative to domain-specific models for clinical prediction.

摘要

电子健康记录(EHR)为临床预测提供了巨大潜力,但其复杂性和异质性给传统机器学习方法带来了重大挑战。近期研究表明,基于大量未标注EHR数据训练的领域专用基础模型能显著提升预测准确性和泛化能力。然而,这类模型的开发受到多样化高质量数据集获取受限、编码标准与临床实践不一致等因素制约。本研究探索利用通用大语言模型(LLM)将EHR编码为高维表征以支持下游临床预测任务。通过将结构化EHR数据转换为标记文本格式(用自然语言描述替代医疗代码),我们无需访问私有医疗训练数据即可利用LLM的广泛语义理解与泛化能力作为高效EHR编码器。实验证明,在EHRSHOT基准测试的15项临床任务中,基于LLM的嵌入表征性能常可媲美甚至超越专用EHR基础模型CLMBR-T-Base。为验证泛化性,我们进一步在英国生物银行(UKB)队列(与CLMBR-T-Base训练人群不同)上评估该方法。值得注意的是,其中一个测试的LLM模型在疾病发作、住院和死亡率预测方面表现优异,显示出对患者群体变化的强健性。研究结果表明,改造通用LLM进行EHR编码可为临床预测提供可扩展且泛化性强的领域专用模型替代方案。


Think When You Need: Self-Adaptive Chain-of-Thought Learning

Abstract

arXiv:2504.03234v2 Announce Type: replace-cross Abstract: Chain of Thought (CoT) reasoning enhances language models' performance but often leads to inefficient "overthinking" on simple problems. We identify that existing approaches directly penalizing reasoning length fail to account for varying problem complexity. Our approach constructs rewards through length and quality comparisons, guided by theoretical assumptions that jointly enhance solution correctness with conciseness. Moreover, we further demonstrate our method to fuzzy tasks where ground truth is unavailable. Experiments across multiple reasoning benchmarks demonstrate that our method maintains accuracy while generating significantly more concise explanations, effectively teaching models to "think when needed."

摘要

思维链(CoT)推理提升了语言模型的性能,但常导致简单问题上的低效"过度思考"。我们发现现有直接惩罚推理长度的方法未能考虑问题复杂度的差异。本方法通过长度与质量的对比构建奖励机制,其理论依据是协同提升解答正确性与简洁性。此外,我们进一步将方法应用于缺乏标准答案的模糊任务。在多推理基准测试中,本方法在保持准确性的同时生成显著更简洁的解释,有效教会模型"在需要时思考"。


Large Language Models Post-training: Surveying Techniques from Alignment to Reasoning

Abstract

arXiv:2503.06072v2 Announce Type: replace-cross Abstract: The emergence of Large Language Models (LLMs) has fundamentally transformed natural language processing, making them indispensable across domains ranging from conversational systems to scientific exploration. However, their pre-trained architectures often reveal limitations in specialized contexts, including restricted reasoning capacities, ethical uncertainties, and suboptimal domain-specific performance. These challenges necessitate advanced post-training language models (PoLMs) to address these shortcomings, such as OpenAI-o1/o3 and DeepSeek-R1 (collectively known as Large Reasoning Models, or LRMs). This paper presents the first comprehensive survey of PoLMs, systematically tracing their evolution across five core paradigms: Fine-tuning, which enhances task-specific accuracy; Alignment, which ensures ethical coherence and alignment with human preferences; Reasoning, which advances multi-step inference despite challenges in reward design; Efficiency, which optimizes resource utilization amidst increasing complexity; Integration and Adaptation, which extend capabilities across diverse modalities while addressing coherence issues. Charting progress from ChatGPT's alignment strategies to DeepSeek-R1's innovative reasoning advancements, we illustrate how PoLMs leverage datasets to mitigate biases, deepen reasoning capabilities, and enhance domain adaptability. Our contributions include a pioneering synthesis of PoLM evolution, a structured taxonomy categorizing techniques and datasets, and a strategic agenda emphasizing the role of LRMs in improving reasoning proficiency and domain flexibility. As the first survey of its scope, this work consolidates recent PoLM advancements and establishes a rigorous intellectual framework for future research, fostering the development of LLMs that excel in precision, ethical robustness, and versatility across scientific and societal applications.

摘要

大型语言模型(LLM)的出现从根本上改变了自然语言处理领域,使其在对话系统到科学探索等各个领域成为不可或缺的工具。然而,其预训练架构在专业场景中往往暴露出诸多局限性,包括推理能力受限、伦理不确定性以及领域特定性能欠佳等问题。这些挑战催生了高级训练后语言模型(PoLM)的发展,例如OpenAI-o1/o3和DeepSeek-R1(统称为大型推理模型LRM)。本文首次对PoLM进行全面综述,系统追溯其在五大核心范式中的演进历程:微调范式提升任务特定精度,对齐范式确保伦理一致性与人类偏好匹配,推理范式突破多步推理的奖励设计难题,效率范式在复杂度增长中优化资源利用,整合与适应范式通过多模态扩展能力并解决连贯性问题。从ChatGPT的对齐策略到DeepSeek-R1的创新推理进展,我们阐释了PoLM如何利用数据集减轻偏见、深化推理能力并增强领域适应性。本研究的贡献包括:开创性地梳理PoLM演进脉络,建立技术方法与数据集的分类体系,提出以LRM提升推理能力与领域灵活性的战略议程。作为该领域首篇系统性综述,本研究整合了近期PoLM的重要进展,为未来研究构建严谨的学术框架,推动LLM在科学与社会应用中实现精准性、伦理鲁棒性和多功能性的协同发展。


BriLLM: Brain-inspired Large Language Model

Abstract

arXiv:2503.11299v3 Announce Type: replace-cross Abstract: This paper reports the first brain-inspired large language model (BriLLM). This is a non-Transformer, non-GPT, non-traditional machine learning input-output controlled generative language model. The model is based on the Signal Fully-connected flowing (SiFu) definition on the directed graph in terms of the neural network, and has the interpretability of all nodes on the graph of the whole model, instead of the traditional machine learning model that only has limited interpretability at the input and output ends. In the language model scenario, the token is defined as a node in the graph. A randomly shaped or user-defined signal flow flows between nodes on the principle of "least resistance" along paths. The next token or node to be predicted or generated is the target of the signal flow. As a language model, BriLLM theoretically supports infinitely long nn-gram models when the model size is independent of the input and predicted length of the model. The model's working signal flow provides the possibility of recall activation and innate multi-modal support similar to the cognitive patterns of the human brain. At present, we released the first BriLLM version in Chinese, with 4000 tokens, 32-dimensional node width, 16-token long sequence prediction ability, and language model prediction performance comparable to GPT-1. More computing power will help us explore the infinite possibilities depicted above.

摘要

本文报道了首个脑启发式大型语言模型(BriLLM)。这是一种非Transformer架构、非GPT范式、非传统机器学习输入输出控制的生成式语言模型。该模型基于神经网络有向图上的信号全连接流(SiFu)定义,具备整个模型图中所有节点的可解释性,而非传统机器学习模型仅在输入输出端具有有限可解释性。在语言模型场景中,词元被定义为图中的节点。随机形态或用户定义的信号流按照"最小阻力"原则在节点间路径上流动,待预测或生成的下一个词元或节点即为信号流的目标。作为语言模型,当模型规模与输入及预测长度无关时,BriLLM在理论上支持无限长的n元语法模型。模型的工作信号流提供了类似于人脑认知模式的回忆激活机制和先天多模态支持可能性。目前我们发布了首个中文版BriLLM,具备4000词元容量、32维节点宽度、16词元长序列预测能力,其语言模型预测性能与GPT-1相当。更多计算资源将助力我们探索上述描述的无限可能性。